I’ve got a fairly large XML document that I’d like to scrape some information out of. It’s too big to hold in memory, so I thought a SAX parser would be appropriate.
Unfortunately, whoever produced the XML doc didn’t read the spec closely enough, so it contains some illegal XML entities (like ). Other than this, though, it’s good as far as I can tell.
For any libraries that rely on libxml, errors like these will disable future SAX processing unless they are run in recovery mode
/*
* [ WFC: Legal Character ]
* Characters referred to using character references must match the
* production for Char.
*/
if (IS_CHAR(val)) {
return(val);
} else {
ctxt->errNo = XML_ERR_INVALID_CHAR;
if ((ctxt->sax != NULL) && (ctxt->sax->error != NULL))
ctxt->sax->error(ctxt->userData,
"xmlParseCharRef: invalid xmlChar value %d\n",
val);
ctxt->wellFormed = 0;
if (ctxt->recovery == 0) ctxt->disableSAX = 1;
}
return(0);
However, both LibXML::XML::SaxParser and Nokogiri::XML::SAX seem hard-coded to not run in recovery mode, so once I run into an illegal entity, parsing pretty much stops (the former throws an error, and the latter just stops showing element start/ends).
Is there a way I can run one of these (or another SAX parser) in recovery mode?
Ox is another ruby XML parser, but it doesn’t use
libxml2as a backend. It compares pretty well to Nokogiri speedwise.And it doesn’t give a whit about legit XML entities, making running in recovery mode a non-issue.
Adapting the SAX example: