I have a xml file with a malformed HTML in its content ..
Since xml cannot parse html tags like <br> I have used CDATA for saving and parsing .
I have used documentBuilder.setCoalescing(true) ; while parsing for recovering data <![CDATA[<br>test<br>data<br>]]> without CDATA tag ..
but in the optput < and > tags are replaced by < and > respectively ..
I m expecting this string in result …
<br>test<br>data<br>
in the parsed string .
How to do this ? Any Idea ?
Thanks in advance !
UPDATE:I have two more Questions in follow up ..
1.Is there any way to make a malformed HTML (eg.<br>) to parsable xml (eg.<br/>) via code , if so will it handle also ?
2.Is there any solution to convert a html text to plain text via java (eg.<div>test text</div> to test text)?
Coalescing is an operation where the contents of CDATA sections (nodes) are converted to text nodes and merged with the contents of adjacent text nodes. This requirement in itself of converting CDATA sections to text nodes will impose the restriction that the resulting text nodes be composed of valid XML characters. This will preserve original document formatting; in other words, the structure of the nodes in the original document will not undergo a change.
The resulting behavior is that of the 5 predefined entities –
<, >, &, " and ', the first three will be expanded, for their unaltered presence will change document structure.In short, you cannot do what you intend to do, by extracting values from the DOM. You’ll need to decode the values into what you desire, after parsing the document. Apache Commons Lang has a utility class – StringEscapeUtils that possesses the desired method.