I need help doing a few things with XPath in PHP.
With any given HTML, I need to:
- Remove all tables and their contents
- Remove everything after the first h1 tag
- Keep only paragraphs (INCLUDING their inner HTML (links, lists, etc))
With regex, I got everything working perfectly. When I encountered nested tables, however, I decided that it is indeed foolish to parse HTML with regex.
Thanks so much!
This can be done very easily with XSLT:
In case your element names are not in the XHtml namespace, simple delete any occurence of
h:in the above code.