I am using python 2.7 with latest lxml library. I am parsing a large XML file with very homogenous structure and millions of elements. I thought lxml’s iterparse would not build an internal tree while it parses, but apparently it does since memory usage grows until it crashes (around 1GB). Is there a way to parse large XML file using lxml without using a lot of memory?
I saw the target parser interface as one possibility, but I’m not sure if that will work any better.
Try using Liza Daly’s fast_iter:
fast_iterremoves elements from the tree after they have been parsed, and also previous elements (maybe with other tags) that are no longer needed.It could be used like this: