I have an xml-file with a format similar to docx, i.e.:
<w:r>
<w:rPr>
<w:sz w:val="36"/>
<w:szCs w:val="36"/>
</w:rPr>
<w:t>BIG_TEXT</w:t>
</w:r>
I need to get an index of BIG_TEXT in source xml, like:
from lxml import etree
text = open('/devel/tmp/doc2/word/document.xml', 'r').read()
root = etree.XML(text)
start = 0
for e in root.iter("*"):
if e.text:
offset = text.index(e.text, start)
l = len(e.text)
print 'Text "%s" at offset %s and len=%s' % (e.text, offset, l)
start = offset + l
I can start a new search from position of current index + len(text), but is there another way? Element may have one character, w for example. It will find index of w, but not index of tag text w.
I was looking for a similar solution (indexing nodes in a big xml file for fast lookup).
lxmlonly offers sourceline, which is insufficient. Cf API :Original line number as found by the parser or None if unknown.expatprovides the exact offset in the file : CurrentByteIndex.start_elementhandler, it returns tag’s start (ie'<') offset.char_datahandler, it returns data’s start (ie'B'in your example) offset.Example :