My aim is to parse 25 GB of XML data. An example of such

Question

0

Asked: June 1, 20262026-06-01T13:05:36+00:00 2026-06-01T13:05:36+00:00

My aim is to parse 25 GB of XML data. An example of such

0

My aim is to parse 25 GB of XML data. An example of such a data is given below:

<Document>
<Data Id='12' category='1'  Body="abc"/>
<Data Id='13' category='1'  Body="zwq"/>
.
.
<Data Id='82018030' category='2' CorrespondingCategory1Id='13' Body="pqr"/>

However..considering the data I have of “25 GB”…my approach is quite inefficient. Please suggest some way to improve my code or an alternate approach. Also kindly include a small example code to make things clearer.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-01T13:05:38+00:00

You might find that a SAX parser works better for this task. Rather than building a DOM, a SAX parser turns the XML file into a stream of elements and calls functions you provide to let you handle each element.

The good thing is that SAX parsers can be very fast and memory-efficient compared to DOM parsers, and some don’t even need to be given all the XML at once, which would be ideal when you have 25 GB of it.

Unfortunately, if you need any context information, like “I want tag <B> but only if it’s inside tag <A>,” you must maintain it yourself, since all the parser gives you is “start tag <A>, start tag <B>, end tag <B>, end tag <A>.” It never explicitly tells you that tag <B> is inside tag <A>, you have to figure that out from what you saw. And once you have seen an element, it’s gone unless you remembered it yourself.

This gets very hairy for complex parsing jobs, but yours is probably manageable.

It happens that Python’s standard library has a SAX parser in xml.sax. You probably want something like xml.sax.xmlreader.IncrementalParser.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

My aim is to parse 25 GB of XML data. An example of such

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply