I have an Android application which grabs some data from an external XML source. I’ve stripped out some HTML from one of the XML elements, but it’s in the format:
<p class="x">Some text...</p>
<p>Some more text</p>
<p>Some final text</p>
I want to extract the middle paragraph text, how can I do this? Would a regular expression be the best way? I don’t really want to start including external HTML parsing libraries.
RegEx match open tags except XHTML self-contained tags
So, I’ll ask the question that wraps up the linked-to answer: have you tried using an XML parser instead?
You might get some ideas from some of the other answers there, too, but I’d try to avoid the regex path. As Macarse suggested, clean this up on the server if you can. If not, wrap those three
<p>elements in a single root element and parse it using SAX or something, paying attention to the 2nd paragraph element.