I’m trying to use the pattern described in the “event-driven parsing” section of the lxml tutorial.
In my code I’m calling a function that can recursively run on elements using the iterchildren() method. I’ll just use two nested loop for illustration here.
This works as expected:
xml = StringIO("<root><a><b>data</b><c><d/></c></a><a><z/></a></root>")
for ev, elem in etree.iterparse(xml):
if elem.tag == 'a':
for c in elem.iterchildren():
for gc in c.iterchildren():
print gc
The output is <Element d at 0x2df49b0>.
But if I add .clear() in the end:
for ev, elem in etree.iterparse(xml):
if elem.tag == 'a':
for c in elem.iterchildren():
for gc in c.iterchildren():
print gc
elem.clear()
— it doesn’t print anything. Why is it so and what do I do to work around this?
Notes:
- I can skip
iterchildrenand dofor c in elemorfor c in list(elem), with the same effect. - I need to use iterative approach to keep the memory usage low.
-
In the real use case, I am doing an element lookup using an attribute:
if elem.attrib.get('id') == elem_id: return _get_info(elem)
I would like an explanation of how clear manages to erase the inner elements before they are processed, and how to keep them in memory while they’re needed for processing of the ancestors.
The problem is that
iterparseby default yieldsendevents. For subelementsendevents are generated earlier than for their ancestors:In this simple case the problem can be solved by relying on the
startevents instead:However, the linked docs say,
That probably means that both
startandendevents should be processed to allow callingclear()only when it’s safe. I’m thinking of implementing some kind of a stack and pushing things there onstartevents, popping andclearing onendevents.I would very much welcome answers that show implementations using this or other ideas.
Note: the easiest thing to do is just indent the
clearcall into theif. That would allow accessing grandchildren, but all unrelated elements would remain uncleared.EDIT
I am currently using a state-keeping variable in the following manner: