I have a xml like this:
<a>
<b>hello</b>
<b>world</b>
</a>
<x>
<y></y>
</x>
<a>
<b>first</b>
<b>second</b>
<b>third</b>
</a>
I need to iterate through all <a> and <b> tags, but I don’t know how many of them are in document. So I use xpath to handle that:
from lxml import etree
doc = etree.fromstring(xml)
atags = doc.xpath('//a')
for a in atags:
btags = a.xpath('b')
for b in btags:
print b
It works, but I have pretty big files, and cProfile shows me that xpath is very expensive to use.
I wonder, maybe there is there more efficient way to iterate through indefinitely number of xml-elements?
XPath should be fast. You can reduce the number of XPath calls to one:
If that is not fast enough, you could try Liza Daly’s fast_iter. This has the advantage of not requiring that the entire XML be processed with
etree.fromstringfirst, and parent nodes are thrown away after the children have been visited. Both of these things help reduce the memory requirements. Below is a modified version offast_iterwhich is more aggressive about removing other elements that are no longer needed.Liza Daly’s article on parsing large XML files may prove useful reading to you too. According to the article, lxml with
fast_itercan be faster thancElementTree‘siterparse. (See Table 1).