Let’s say I have the following HTML:
<div>
text1
<div>
t1
</div>
text2
<div>
t2
</div>
text3
</div>
I know of how to get the text and subelements of the enclosing div using lxml.html. But is there a way to access both text and sub elements in an iterative manner, that preserves order? In other words, I want to know where the “free text” of the div appears relative to the images. I would like to be able to know that “text1” appears before the first inner-div, and that text2 appears between the two inner-divs, etc.
The
elementtreeinterface, whichlxmlalso offers, supports that — e.g. with the built-in element tree in Python 2.7:Depending on your version of lxml/elementtree, you may need to spell the iterator method
.getiterator()instead of.iter().If you need a single generator that will yields tags and texts in order, for example:
This basically removes the
Nones and yields two-tuples with a first item of'tag','text', or'tail', to help you distinguish. I imagine this is not your ideal format, but it should not be hard to mold it into something more to your liking;-).