I am using lxml iterparse to read huge xml files. For a given mainElement,

Question

0

Asked: May 26, 20262026-05-26T21:10:40+00:00 2026-05-26T21:10:40+00:00

I am using lxml iterparse to read huge xml files. For a given mainElement,

0

I am using lxml iterparse to read huge xml files. For a given mainElement, I check the child elements and process each child. But I notice that, on checking the children within an element, the parser is actually missing some child nodes sometimes. I even printed the length of each element, which should be a constant number for a given element tag, but it is sometimes smaller than it should be. And surprisingly, this happens usually during the 5th block (one block=> mainElement occurance). Is there a reason why the parser should miss the child nodes? Any clues?

Sample code-

from lxml import etree  
def parseXml(context,attribList,elemList,mainElement):      
   for event, element in context: 
       if element.tag == mainElement and event=='start':
            for child in element:
               if child.tag in elemList:
                   print len(child) #for a given child,the len should be constant
                   #do things   
       elif event=='end':
         element.clear()

Thanks!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T21:10:41+00:00

When you define the context, be sure to set the parameter events to ('end',) rather than ('start',). Otherwise, you can get the very behavior you are describing.

context=etree.iterparse(filehandle, events=('end',), tag=mainElement)

I think the problem is that lxml is processing the XML in one thread while running parseXml in another, so you can reach a start element in parseXml before lxml is done parsing to the corresponding end element. So when you loop through the element’s children, you only get a partial result.

By the way, this article gives a nice way to organize this, designed for processing very large XML:

def fast_iter(context, func, *args, **kwargs):
    # http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    # Author: Liza Daly
    for event, elem in context:
        func(elem, *args, **kwargs)
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context

def parseXml(element,attribList,elemList): 
    for child in element:
       if child.tag in elemList:
           print len(child) #for a given child,the len should be constant
           #do things   

context=etree.iterparse(filehandle, events=('end',), tag=mainElement)   
fast_iter(context, parseXml, attribList, elemList)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am using lxml iterparse to read huge xml files. For a given mainElement,

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply