I am trying to parse an XML file using Python. Due to the size of the XML, I want to use a Pull Parser. I found this one.
My code starts with
doc = pulldom.parse("myfile.xml")
for event, node in doc:
# code here...
I am using
if (node.localName == "b"):
to get the XML tag name, and it works fine.
What I can’t find how to do is get the text from between the tags. Using node.nodeValue returns None.
I can use node.toxml() to get the full XML for the node, but I only want the text between the tags. Is there a way to do this other than using a regex replace to take the tags out of node.toxml()?
You have two nodes with local name “b” for every tag with text – a
START_ELEMENTand anEND_ELEMENT. Normally you should receive something like this:So you are looking for the characters after a matching start-element. You may want to try something like this:
With this
myfile.xmlI get the output
Note that you might need to
strip()each string and you must ignore every otherCHARACTERS-event. Every linebreak and whitespace between two elements generate aCHARACTERS-event.