XML File Sample
<GateDocument>
<GateDocumentFeatures>
...
</GateDocumentFeatures>
<TextWithNodes>
<Node id="0"/>
MESSAGE SET
<Node id="19"/>
<Node id="20"/>
1. 1/1/09 - sample text 1
<Node id="212"/>
sample text 2
<Node id="223"/>
sample text 3
...
<Node id="160652"/>
</TextWithNodes>
<AnnotationSet></AnnotationSet>
<AnnotationSet Name="SomeName">
...
</AnnotationSet>
</GateDocument>
Just to start off, this is the first I’m coding in Python and dealing with XML, so sorry if I miss really obvious things!
My goal is to extract the sample text at specific node ids.
First attempt – I used minidom, which did not give me the correct methods in dealing with the extraction (http://stackoverflow.com/questions/11122736/extracting-text-from-xml-node-with-minidom) due to this weird format of the node ids in self-closing tags.
Second attempt – I took up suggestions in looking at lxml, I have successfully extracted the text to something like this:
['\n\t\t','n\t\tMESSAGE SET\n\t\t','\n\t\t','\n\t\t1. 1/1/09 - sample text 1,....,'\n\t']
With some clean up, I think I can get the text fine, however, I lose the associated node id value.
with the code:
from lxml import etree
from StringIO import StringIO
xmlfile = ('C:\...AnnotationsXML.xml')
xmldoc = etree.parse(xmlfile)
print xmldoc.xpath("//TextWithNodes/text()")
So I guess my questions is:
- Is there a way to extract the above without the \n\t\t? I read that it is the space formating (ie tab) but I am not sure where the
<Node id = 0>went. - Is there perhaps a better or more efficient method in extraction for this file?
Thanks!
stripis used to get rid of stuff like\t\nandtailtakes the text after the tag.