I’ve got some HTML files that need to be parsed and cleaned, and they occasionally have content with special characters like <, >, “, etc. which have not been properly escaped.
I have tried running the files through jTidy, but the best I can get it to do is just omit the content it sees as malformed html. Is there a different library that will just escape the malformed fragments instead of omitting them? If not, any recommendations on what library would be easiest to modify?
Clarification:
Sample input: <p> blah blah <M+1> blah </p>
Desired output: <p> blah blah <M+1> blah </p>
You can also try TagSoup. TagSoup emits regular old SAX events so in the end you get what looks like a well-formed XML document.
I have had very good luck with TagSoup and I’m always surprised at how well it handles poorly constructed HTML files.