I have a large number of HTML files that I need to process with XSLT, using an XML file to choose which HTML files, and what we’re doing with them.
I tried:
- Use HTML Tidy to convert HTML -> XHTML / XML
- Use document(filename) in XSLT to read in particular XHTML/XML files
- …use standard nodeset commands to access e.g. “html/body/*”
This doesn’t work, because:
- It seems that XSLT (tried: libXSLT/xsltproc … and Saxon) cannot process XHTML documents as external files (it sees the xhtml DOCTYPE, and refuses to parse it as nodes).
Fine (I thought) … XHTML is just XML, I just need to put it through HTML Tidy and say:
“output-xml yes … output-html no … output-xhtml no”
…but HTML Tidy ignores you if you attempt that, and forces html instead :(. It seems to be hardcoded to only output XML files if the input was XML to begin with.
Any ideas for how to:
- Force HTML Tidy to obey the command-line parameters, and set the doctype I asked for
- Force XSLTproc to parse xhtml DOCTYPEs as xml
- …some other cunning way that will work?
NB: this has to work on OS X – it’s part of a build process for iOS apps. That shouldn’t be a big problem, but e.g. any windows-only tools aren’t available. I’d like to achieve this with standard open-source cross-platform tools (like tidy, libxslt, etc)
I finally discovered why XSLTproc / Saxon were refusing to parse the files if they were passed-in with a DOCTYPE html:
…strangely, if the DOCTYPE was xml, then they happily ignored the xmlns command – or they allowed me to reference nodes by unqualified name. This fooled me into thinking that they were point-blank ignoring the nodesets inside the xhtml DOCTYPE’d version.
So, the “solution” is something like this:
Example code:
Your stylesheet goes from this:
…to this:
Your select / match / document-import goes from this:
…to this:
NB: just to be clear: if you ignore namespaces, then it seems XSLT will work on files that are unDOCTYPED, even if they have a namespace in them. Don’t make the mistake I made of thinking your XSLT is correct just because it appears to be 🙂