If I load an HTML page using DOMDocument::loadHTMLFile() then pass it to simplexml_import_dom() everything is fine, however, if I using $dom->saveHTML() to get a string representation from the DOMDocument then use simplexml_load_string(), I get nothing. Actually, if I use a very simple page it will work, but as soon as there is anything more complex, it fails without any errors in the PHP log file.
Can anyone shed light on this?
Is it something to do with HTML not being parsable XML?
I am trying to strip out CR’s and newlines from the formatted HTML text before using the contents as they have nothing to do with the content but get inserted into the SimpleXMLElement object, which is rather tedious.
Is it something to do with HTML not being parsable XML?YES! HTML is a far less strict syntax so
simplexml_load_stringwill not work with it by itself. This is becausesimplexmlis simple and HTML is convoluted. On the other hand, DOMDocument is designed to be able to read the convoluted HTML structure, which means that since it can make sense of HTML andsimplexmlcan make sense of it, you can bridge the proverbial gap there.