I’m loading a string containing some html into an XmlDocument class, in order to do some manipulation on it, before converting it back into a string again.
The following code demonstrates what I’m doing;
// Example of the HTML I am working with
var documentTypeDeclaration = "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">";
var html = documentTypeDeclaration + "<html><body><div>£300 ©</div></body></html>";
// Load the HTML into an XmlDocument
var xmlDocument = new XmlDocument();
xmlDocument.XmlResolver = null;
xmlDocument.LoadXml(html);
// Manipulate the HTML...
// Get the HTML back out
var savedHtml = xmlDocument.OuterXml;
Console.WriteLine(html);
Console.WriteLine(savedHtml);
I would expect the two lines output to the Console to match, but instead I get this-
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html><body><div>£300 ©</div></body></html>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"[]><html><body><div>£300 ©</div></body></html>
So it looks like [] has been added to the doc type declaration, and all the HTML character classes have been converted to their actual characters. This is particularly annoying as the HTML is now no longer standards compliant.
Does anyone know how I can stop the XmlDocument class from doing this?
No, but I would use a real html parser instead of XmlParser