My aim is to parse 25 GB of XML data. An example of such a data is given below:
<Document>
<Data Id='12' category='1' Body="abc"/>
<Data Id='13' category='1' Body="zwq"/>
.
.
<Data Id='82018030' category='2' CorrespondingCategory1Id='13' Body="pqr"/>
However..considering the data I have of “25 GB”…my approach is quite inefficient. Please suggest some way to improve my code or an alternate approach. Also kindly include a small example code to make things clearer.
You might find that a SAX parser works better for this task. Rather than building a DOM, a SAX parser turns the XML file into a stream of elements and calls functions you provide to let you handle each element.
The good thing is that SAX parsers can be very fast and memory-efficient compared to DOM parsers, and some don’t even need to be given all the XML at once, which would be ideal when you have 25 GB of it.
Unfortunately, if you need any context information, like “I want tag
<B>but only if it’s inside tag<A>,” you must maintain it yourself, since all the parser gives you is “start tag<A>, start tag<B>, end tag<B>, end tag<A>.” It never explicitly tells you that tag<B>is inside tag<A>, you have to figure that out from what you saw. And once you have seen an element, it’s gone unless you remembered it yourself.This gets very hairy for complex parsing jobs, but yours is probably manageable.
It happens that Python’s standard library has a SAX parser in
xml.sax. You probably want something likexml.sax.xmlreader.IncrementalParser.