I have the following XML document:
<x>
<a>Some text</c>
<b>Some text 2</b>
<c>Some text 3</c>
</x>
I want to get the text of all the tags, so I decided to use getiterator().
My problem is, it adds up blank lines for a reason I can’t understand. Consider this:
>>> for text in document_root.getiterator():
... print text.text
...
Some text
Some text 2
Some text 3
Notice the two extra blank lines before ‘Some text’. What is the reason for this? If I pass a tag to the getiterator() method, there are no blank lines, as it should be.
>>> for text in document_root.getiterator('a'):
... print text.text
...
Some text
So my question is, what is causing those extra blank lines in case I pass getiterator() without a tag and how do I remove them?
By default
lxml.etreewill regard empty text between tags as the textual content for that tag and in your case the whitespace being displayed comes from<x>. If you want a parser that ignores the whitespace you’ll want to do something like:Note how
node.textwill return None if there is no text at all. Also note that the API documentation for lxml states thatgetiterator()is deprecated in favor ofiter().For more information see The lxml.etree Tutorial: Parser objects.