We have a bog-standard javax.xml.* parser that slurps up a big XML file and tries to validate it against a custom DTD. The DTD is stored locally, and we’re validating using transformers like in this post from some years back.
All of that works. The trouble we’re seeing now is that the XML format for this type of file is written by the devil. I’m not kidding; the specification is over 750 pages and is signed “Love, Satan.”
Specifically, part of the XML looks like this:
<KnownTag>
<ArbitraryTag> ... text ... </ArbitraryTag>
<Whatever> ... text ... </Whatever>
<fj9e8jer23tj> ... text ... </fj9e8jer23tj>
....
</KnownTag>
The inner tags are balanced — the raw syntax is known to be well-formed XML at this point — but the element names themselves are completely arbitrary and unpredictable. (Yes, it’s that evil. The company that originally published this spec has long since gone out of business because their products were notoriously unreliable. Go figure.)
Our custom DTD can specify <!ELEMENT KnownTag ANY>, but we’re having fits with the content. Obviously the validating parser gives errors as soon as it hits the first user-specified element name (element type “ArbitraryTag” must be declared), and obviously we can’t truly “validate” anything inside that block from a purely parsing context. I’m hoping to find some way of suppressing the errors for just that section of XML.
-
The parser’s error handler interface, ErrorHandler, specifies 3 callbacks; its
error()is called in this case. If I can figure out from the actual exception passed in that we’re inside aKnownTagblock, then I can safely ignore the error and keep going. Is this safe to do with the Java SE implementation? -
Getting to the arbitrary elements afterwards shouldn’t be a problem, since the XML parser itself has already built a DOM Document by this point.
-
The API for
javax.xml.parsers.DocumentBuilder[Factory]andjavax.xml.transform.Transformerdon’t seem to permit togglingDocumentBuilderFactory#setValidating()midway through the parse. If that’s the case it won’t be surprising, but I’m hoping that I’ve just missed something. Anyone?
DTDs have no mechanism for skipping validation on particular subtrees of well-formed XML; that’s one of the differences between DTDs and later schema languages like XSD and Relax NG, which introduce wildcards to make it possible to say things like “The
KnownTagelement can contain arbitrary XML” (or: arbitrary elements not in a particular namespace, or in any of a particular set of namespaces, or …).Whether your parser has a facility to turn error reporting off in a particular subtree is entirely parser-specific; you’ll need to describe just which of the many Java-based XML parsers you are using. The chances are slim; it’s not impossible for a parser to have such a feature, but at first description it doesn’t sound like spec-conformant behavior. (It’s also not a feature I’ve ever heard of a DTD-based validator having, but that doesn’t actually prove much.)