I have been using PHP’s DOM to load an html template, modify it and output it. Recently I discovered that self-closing (empty) tags don’t include a closing slash, even though the template file did.
e.g.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"`"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
</head>
<body>
</body>
</html>
becomes:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
</body>
</html>
Is this a bug or a setting, or a doctype issue?
DOMDocument->saveHTML()takes your XML DOM infoset and writes it out as old-school HTML, not XML. You should not usesaveHTML()together with an XHTML doctype, as its output won’t be well-formed XML.If you use
saveXML()instead, you’ll get proper XHTML. It’s fine to serve this XML output to standards-compliant browsers if you give it aContent-Type: application/xhtml+xmlheader. But unfortunately IE6-8 won’t be able to read that, as they can still only handle old-school HTML, under thetext/htmlmedia type.The usual compromise solution is to serve
text/htmland use ‘HTML-compatible XHTML’ as outlined in Appendix C of the XHTML 1.0 spec. But sadly there is no PHPDOMDocument->saveXHTML()method to generate the correct output for this.There are some things you can do to persuade
saveXML()to produce HTML-compatible output for some common cases. The main one is that you have to ensure that only elements defined by HTML4 as having anEMPTYcontent model (<img>,<br>etc) actually do have empty content, causing the self-closing syntax (<img/>) to be used. Other elements must not use the self-closing syntax, so if they’re empty you should put a space in their text content to stop them being so:The other one to look out for is handling of the inline
<script>and<style>elements, which are normal elements in XHTML but specialCDATA-content elements in HTML. Some/*<![CDATA[*/.../*]]>*/wrapping is required to make any<or&characters inside them behave mostly-consistently, though note you still have to avoid the]]>and</sequences.If you want to really do it properly you would have to write your own HTML-compatible-XHTML serialiser. Long-term that would probably be a better option. But for small simple cases, hacking your input so that it doesn’t contain anything that would come out the other end of an XML serialiser as incompatible with HTML is probably the quick solution.
That or just suck it up and live with old-school non-XML HTML, obviously.