I have an html document formatted this way: <p> some plain text <em>some emphatized

Question

0

Asked: June 11, 20262026-06-11T00:13:09+00:00 2026-06-11T00:13:09+00:00

I have an html document formatted this way: <p> some plain text <em>some emphatized

0

I have an html document formatted this way:

<p>
 some plain text <em>some emphatized text</em>, <strong> some strong text</strong>
</p>
<p>
 just some plain text
</p>
<p>
  <strong>strong text </p> followed by plain, <a>with a link at the end!</a>
</p>

I’d like to extract the text. With dom like parsers I could extract each paragraph

, but the problem is inside: I’d have to extract text from inner tags too and have a resulting string with the same order, in the example above, first paragraph, I want to extract:

some plain text some emphatized text, some strong text

and for this purpose I guess a sax like parser would be better than a dom, given that I can’t know inner tags number o sequence: a paragraph can have zero or more inner tags, of different type.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-11T00:13:11+00:00

You can use dom parsers, get the text inside of the p tags (including child html elements) in to a string variable and use some other functionality to strip all the html tags out of the resulting string. This should leave you with all of the content between the p tags without any of the child element tags.

Example

<p>
    some plain text <em>some emphatized text</em>, <strong> some strong text</strong>
</p>
<p>
    just some plain text
</p>
<p>
    <strong>strong text </p> followed by plain, <a>with a link at the end!</a>
</p>

Use some dom parser to extract the p tags to strings, you would then have a string like so:

String content = "some plain text <em>some emphatized text</em>, <strong> some strong text</strong>";
content = stripHtmlTags( content );
println( content ); // some plain text some emphatized text, some strong text

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have an html document formatted this way: <p> some plain text <em>some emphatized

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply