XML File Sample <GateDocument> <GateDocumentFeatures> … </GateDocumentFeatures> <TextWithNodes> <Node id=0/> MESSAGE SET <Node id=19/>

Question

0

Asked: June 6, 20262026-06-06T04:14:52+00:00 2026-06-06T04:14:52+00:00

XML File Sample <GateDocument> <GateDocumentFeatures> … </GateDocumentFeatures> <TextWithNodes> <Node id=0/> MESSAGE SET <Node id=19/>

0

XML File Sample

<GateDocument> 
  <GateDocumentFeatures>
    ...
  </GateDocumentFeatures>
  <TextWithNodes>
    <Node id="0"/>
    MESSAGE SET
    <Node id="19"/> 
    <Node id="20"/>
    1. 1/1/09 - sample text 1
    <Node id="212"/>
    sample text 2
    <Node id="223"/>
    sample text 3
    ...
    <Node id="160652"/>
  </TextWithNodes>
  <AnnotationSet></AnnotationSet>
  <AnnotationSet Name="SomeName">
    ...
  </AnnotationSet>
</GateDocument>

Just to start off, this is the first I’m coding in Python and dealing with XML, so sorry if I miss really obvious things!

My goal is to extract the sample text at specific node ids.

First attempt – I used minidom, which did not give me the correct methods in dealing with the extraction (http://stackoverflow.com/questions/11122736/extracting-text-from-xml-node-with-minidom) due to this weird format of the node ids in self-closing tags.

Second attempt – I took up suggestions in looking at lxml, I have successfully extracted the text to something like this:

['\n\t\t','n\t\tMESSAGE SET\n\t\t','\n\t\t','\n\t\t1. 1/1/09 - sample text 1,....,'\n\t']

With some clean up, I think I can get the text fine, however, I lose the associated node id value.

with the code:

from lxml import etree
from StringIO import StringIO
xmlfile = ('C:\...AnnotationsXML.xml')
xmldoc = etree.parse(xmlfile)  
print xmldoc.xpath("//TextWithNodes/text()")

So I guess my questions is:

Is there a way to extract the above without the \n\t\t? I read that it is the space formating (ie tab) but I am not sure where the <Node id = 0> went.
Is there perhaps a better or more efficient method in extraction for this file?

Thanks!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-06T04:14:53+00:00

Editorial Team

2026-06-06T04:14:53+00:00Added an answer on June 6, 2026 at 4:14 am

In [1]: from lxml import etree

In [2]: tree = etree.parse('awful.xml')

In [3]: data = {int(node.attrib['id']): node.tail.strip()
   ...: for node in tree.xpath('//TextWithNodes/Node') if node.tail.strip()}

In [4]: data
Out[4]: 
{0: 'MESSAGE SET',
 20: '1. 1/1/09 - sample text 1',
 212: 'sample text 2',
 223: 'sample text 3'}

strip is used to get rid of stuff like \t\n and tail takes the text after the tag.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

XML File Sample <GateDocument> <GateDocumentFeatures> … </GateDocumentFeatures> <TextWithNodes> <Node id=0/> MESSAGE SET <Node id=19/>

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply