I am using lxml iterparse to read huge xml files. For a given mainElement, I check the child elements and process each child. But I notice that, on checking the children within an element, the parser is actually missing some child nodes sometimes. I even printed the length of each element, which should be a constant number for a given element tag, but it is sometimes smaller than it should be. And surprisingly, this happens usually during the 5th block (one block=> mainElement occurance). Is there a reason why the parser should miss the child nodes? Any clues?
Sample code-
from lxml import etree
def parseXml(context,attribList,elemList,mainElement):
for event, element in context:
if element.tag == mainElement and event=='start':
for child in element:
if child.tag in elemList:
print len(child) #for a given child,the len should be constant
#do things
elif event=='end':
element.clear()
Thanks!
When you define the context, be sure to set the parameter
eventsto('end',)rather than('start',). Otherwise, you can get the very behavior you are describing.I think the problem is that lxml is processing the XML in one thread while running
parseXmlin another, so you can reach astartelement inparseXmlbefore lxml is done parsing to the correspondingendelement. So when you loop through the element’s children, you only get a partial result.By the way, this article gives a nice way to organize this, designed for processing very large XML: