Does reading XML data like in the following code create the DOM tree in memory?
my $xml = new XML::Simple;
my $data = $xml->XMLin($blast_output,ForceArray => 1);
For large XML files should I use a SAX parser, with handlers, etc.?
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
I would say yes to both. The XML::Simple library will create the entire tree in memory and it’s a large multiple on the size of the file. For many applications if your XML is over 100MB or so, it’ll be practical impossible to entirely load into memory in perl. A SAX parser is a way of getting “events” or notifications as the file is read and tags are opened or closed.
Depending on your usage patterns, either a SAX or a DOM based parser could be faster: for example, if you are trying to handle just a few nodes, or every node, in a large file, the SAX mode is probably best. For example, reading a large RSS feed and attempting to parse every item in it.
On the other hand, if you need to cross-reference one part of the file with another part, a DOM parser or accessing via XPath will make more sense – writing it in the “inside-out” manner that a SAX parser requires will be clumsy and tricky.
I recommend trying a SAX parser at least once, because the event-driven thinking required to do so is good exercise.
I’ve had good success with XML::SAX::Machines to set up SAX parsing in perl – if you want multiple filters and pipelines it’s easy to set up. For simpler setups (i.e 99% of the time) you just need a single sax filter (look at XML::Filter::Base) and tell XML::SAX::Machines to just parse the file (or read from filehandle) using your filter. Here’s a thorough article.