We’ve been using libxml-ruby for a couple of years. It is fantastic on files of 30 MB or less, but it is PLAGUED by seg faults. Nobody at the project really seems to care to fix them, only to blame these on 3rd party software. That’s their prerogative of course, it’s free.
Yet I still am unable to read these large files. I suppose I could write some miserable hack to split them into smaller files, but I would like to avoid that. Does anyone else have any experience with reading very large XML files in Ruby?
I’d recommend looking into a SAX XML parser. They’re designed to handle huge files. I haven’t needed one in a while, but but they’re pretty easy to use; As it reads the XML file in it will pass your code various events, which you catch and handle with your code.
The Nokogiri site has a link to SAX Machine which is based on Nokogiri, so that would be another option. Either way, Nokogiri is very well supported, and used by a lot of people, including me for all HTML and XML I parse. It supports both DOM and SAX parsing, allows use of CSS and XPath accessors, and uses libxml2 for its parsing, so it’s fast and based on a standard parsing library.