I have a xml file with a malformed HTML in its content .. Since

Question

0

Asked: May 23, 20262026-05-23T02:27:43+00:00 2026-05-23T02:27:43+00:00

I have a xml file with a malformed HTML in its content .. Since

0

I have a xml file with a malformed HTML in its content ..
Since xml cannot parse html tags like <br> I have used CDATA for saving and parsing .

I have used documentBuilder.setCoalescing(true) ; while parsing for recovering data <![CDATA[<br>test<br>data<br>]]> without CDATA tag ..

but in the optput < and > tags are replaced by < and > respectively ..

I m expecting this string in result …

<br>test<br>data<br>

in the parsed string .

How to do this ? Any Idea ?
Thanks in advance !

UPDATE:I have two more Questions in follow up ..

1.Is there any way to make a malformed HTML (eg.<br>) to parsable xml (eg.<br/>) via code , if so will it handle   also ?

2.Is there any solution to convert a html text to plain text via java (eg.<div>test text</div> to test text)?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T02:27:44+00:00

Coalescing is an operation where the contents of CDATA sections (nodes) are converted to text nodes and merged with the contents of adjacent text nodes. This requirement in itself of converting CDATA sections to text nodes will impose the restriction that the resulting text nodes be composed of valid XML characters. This will preserve original document formatting; in other words, the structure of the nodes in the original document will not undergo a change.

The resulting behavior is that of the 5 predefined entities – <, >, &, " and ', the first three will be expanded, for their unaltered presence will change document structure.

In short, you cannot do what you intend to do, by extracting values from the DOM. You’ll need to decode the values into what you desire, after parsing the document. Apache Commons Lang has a utility class – StringEscapeUtils that possesses the desired method.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a xml file with a malformed HTML in its content .. Since

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply