After the parse, I would like to remove dangerous code and write it out again properly formatted.
The purpose is to prevent scripts from entering through an email but still allow the slew of bad HTML to work (at least not fail completely).
Is there a library for that? Is there a better way to keep scripts away from the browser?
The important thing is that the program not throw a Parse Exception. The program may make best guesses and even if it is wrong it will be acceptable.
Edit: I would appreciate any comments on which parsers y’all think are better and why.
Use one of available tools that convert the HTML to XHTML.
for example
http://www.chilkatsoft.com/java-html.asp
http://java-source.net/open-source/html-parsers
http://htmlcleaner.sourceforge.net/
etc
Then use regular XML parser.