I need to write a parser in Python that can process some extremely large files ( > 2 GB ) on a computer without much memory (only 2 GB). I wanted to use iterparse in lxml to do it.
My file is of the format:
<item>
<title>Item 1</title>
<desc>Description 1</desc>
</item>
<item>
<title>Item 2</title>
<desc>Description 2</desc>
</item>
and so far my solution is:
from lxml import etree
context = etree.iterparse( MYFILE, tag='item' )
for event, elem in context :
print elem.xpath( 'description/text( )' )
del context
Unfortunately though, this solution is still eating up a lot of memory. I think the problem is that after dealing with each “ITEM” I need to do something to cleanup empty children. Can anyone offer some suggestions on what I might do after processing my data to properly cleanup?
Try Liza Daly’s fast_iter. After processing an element,
elem, it callselem.clear()to remove descendants and also removes preceding siblings.Daly’s article is an excellent read, especially if you are processing large XML files.
Edit: The
fast_iterposted above is a modified version of Daly’sfast_iter. After processing an element, it is more aggressive at removing other elements that are no longer needed.The script below shows the difference in behavior. Note in particular that
orig_fast_iterdoes not delete theA1element, while themod_fast_iterdoes delete it, thus saving more memory.