I have an xml-file with a format similar to docx, i.e.: <w:r> <w:rPr> <w:sz

Question

0

Asked: June 11, 20262026-06-11T14:52:12+00:00 2026-06-11T14:52:12+00:00

I have an xml-file with a format similar to docx, i.e.: <w:r> <w:rPr> <w:sz

0

I have an xml-file with a format similar to docx, i.e.:

<w:r>
  <w:rPr>
    <w:sz w:val="36"/>
    <w:szCs w:val="36"/>
  </w:rPr>
  <w:t>BIG_TEXT</w:t>
</w:r>

I need to get an index of BIG_TEXT in source xml, like:

from lxml import etree
text = open('/devel/tmp/doc2/word/document.xml', 'r').read()

root = etree.XML(text)

start = 0
for e in root.iter("*"):
    if e.text:
        offset = text.index(e.text, start)
        l = len(e.text)
        print 'Text "%s" at offset %s and len=%s' % (e.text, offset, l)
        start = offset + l

I can start a new search from position of current index + len(text), but is there another way? Element may have one character, w for example. It will find index of w, but not index of tag text w.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-11T14:52:13+00:00

I was looking for a similar solution (indexing nodes in a big xml file for fast lookup).

AFAIK, lxml only offers sourceline, which is insufficient. Cf API : Original line number as found by the parser or None if unknown.
But expat provides the exact offset in the file : CurrentByteIndex.
- Fetched from start_element handler, it returns tag’s start (ie '<') offset.
- Fetched from char_data handler, it returns data’s start (ie 'B' in your example) offset.

Example :

import xml.parsers.expat

# handler functions for parser events, and housekeeping.
class handler :
   def __init__(self, current_parser) :
      #tag of interest
      self.TARGET_TAG = "w:t"

      #set up parser
      self.parser = current_parser
      self.parser.StartElementHandler  = self.start_element
      self.parser.EndElementHandler    = self.end_element
      self.parser.CharacterDataHandler = self.char_data

      self.target_tag_met = False
      self.index = None

   def start_element(self, name, attrs):
      self.target_tag_met = (name == self.TARGET_TAG)

   def end_element(self, name) :
      self.target_tag_met = False

   def char_data(self, data):
      if self.target_tag_met :
         self.index = self.parser.CurrentByteIndex

#open file in binary mode for robuster byte offsets.
xmlFile = open("so_test.xml", 'rb')

p = xml.parsers.expat.ParserCreate()
h = handler(p)

p.ParseFile(xmlFile)
print (h.index)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have an xml-file with a format similar to docx, i.e.: <w:r> <w:rPr> <w:sz

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply