I’m parsing the wikipedia XML dump using a REXML StreamListener. After a few million articles, it complains that it can’t find a matching close tag, and skips the rest of the file.
Is there any way to get it to ignore the unclosed tag, and to resume parsing the stream after it?
The Nokogiri SAX mode is very similar to REXML’s SAX (StreamListener) mode. Sample:
Nokogiri also has a Reader interface which yields every node, in case you don’t like the SAX-style callback interface.
The difference is that Nokogiri can recover from non-well-formedness better than pretty much any parser out there, thanks to the underlying libXML2 recover mode (on by default I believe).