I have an html document formatted this way:
<p>
some plain text <em>some emphatized text</em>, <strong> some strong text</strong>
</p>
<p>
just some plain text
</p>
<p>
<strong>strong text </p> followed by plain, <a>with a link at the end!</a>
</p>
I’d like to extract the text. With dom like parsers I could extract each paragraph
, but the problem is inside: I’d have to extract text from inner tags too and have a resulting string with the same order, in the example above, first paragraph, I want to extract:
some plain text some emphatized text, some strong text
and for this purpose I guess a sax like parser would be better than a dom, given that I can’t know inner tags number o sequence: a paragraph can have zero or more inner tags, of different type.
You can use dom parsers, get the text inside of the p tags (including child html elements) in to a string variable and use some other functionality to strip all the html tags out of the resulting string. This should leave you with all of the content between the p tags without any of the child element tags.
Example
Use some dom parser to extract the p tags to strings, you would then have a string like so: