I am just looking for a really easy way to clean up some HTML (possibly with embedded JavaScript code). I tried two different HTML Tidy .NET ports and both are throwing exceptions…
Sorry, by “clean” I mean “indent”. The HTML is not malformed, at all. It’s XHTML strict.
I finally got something working with SGML, but this is seriously the most ridiculous chunk of code ever to indent some HTML.
private static string FormatHtml(string input)
{
var sgml = new SgmlReader {DocType = "HTML", InputStream = new StringReader(input)};
using (var sw = new StringWriter())
using (var xw = new XmlTextWriter(sw) { Indentation = 2, Formatting = Formatting.Indented })
{
sgml.Read();
while (!sgml.EOF)
xw.WriteNode(sgml, true);
}
return sw.ToString();
}
The latest C# wrapper for HTML Tidy was done by Mark Beaton, which seems rather more up-to-date than the links you’ve referenced (2003). Also worth of note is that Mark provides executables for referencing as well, rather than pulling them from the official site. That should do the trick of nicely organising and validating your HTML.