Using perl’s XML::SAX module I’m parsing (x)html templates, and as a result am simply echoing a lot of the input to output. I have a SAX event handler that extends XML::SAX::Base and implements the usual methods – start_element, end_element and so on.
Now my question concerns elements that do not take a closing tag – e.g <img />, <link />, and <input />. The parser will call start_element($element_name, %attribute_hash) and end_element for these tags, but how do I know that the element is self-contained?
In other words, I want to write out <img src="blah" /> as the same, not as <img ...></img>
which I belive is invalid.
Short of maintaining a list of these elements, what can I do? Is there a way in SAX of directly echoing an element as opposed to reconstructing it from what’s passed to the event handlers?
First, building off Quentin’s comment, you’re using an XML parser to handle HTML. There’s nothing particularly wrong with that as long as the HTML is relatively clean. However, if you need to be in compliance with HTML (as opposed to XHTML), then perhaps an XML parser is the wrong tool.
If you want to hack around it, then here’s what you could do. Implement a
characters()callback, which will set a flag if there are any non-whitespace characters present. Thestart_element()callback will reset this flag. Theend_element()callback will consider the tag empty if the flag was not set and write the syntax accordingly.Note that this will also catch tags like
<td></td>, transforming them to<td />.