I am working with potentially huge XML files containing complex trace information from on

Question

0

Asked: May 15, 20262026-05-15T15:55:29+00:00 2026-05-15T15:55:29+00:00

I am working with potentially huge XML files containing complex trace information from on

0

I am working with potentially huge XML files containing complex trace information from on of my projects.

I would like to build indexes for those XML files so that one can quickly find sub sections of the XML document without having to load it all into memory.

If I have created a “shelve” index that could contains information like “books for author Joe” are at offsets [22322, 35446, 54545] then I can just open the xml file like a regular text file and seek to those offsets and then had that to one of the DOM parser that takes a file or strings.

The part that I have not figured out yet is how to quickly parse the XML and create such an index.

So what I need as a fast SAX parser that allows me to find the start offset of tags in the file together with the start events. So I can parse a subsection of the XML together with the starting point into the document, extract the key information and store the key and offset in the shelve index.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-15T15:55:30+00:00

Since locators return line and column numbers in lieu of offset, you need a little wrapping to track line ends — a simplified example (could have some offbyones;-)…:

import cStringIO
import re
from xml import sax
from xml.sax import handler

relinend = re.compile(r'\n')

txt = '''<foo>
            <tit>Bar</tit>
        <baz>whatever</baz>
     </foo>'''
stm = cStringIO.StringIO(txt)

class LocatingWrapper(object):
    def __init__(self, f):
        self.f = f
        self.linelocs = []
        self.curoffs = 0

    def read(self, *a):
        data = self.f.read(*a)
        linends = (m.start() for m in relinend.finditer(data))
        self.linelocs.extend(x + self.curoffs for x in linends)
        self.curoffs += len(data)
        return data

    def where(self, loc):
        return self.linelocs[loc.getLineNumber() - 1] + loc.getColumnNumber()

locstm = LocatingWrapper(stm)

class Handler(handler.ContentHandler):
    def setDocumentLocator(self, loc):
        self.loc = loc
    def startElement(self, name, attrs):
        print '%s@%s:%s (%s)' % (name, 
                                 self.loc.getLineNumber(),
                                 self.loc.getColumnNumber(),
                                 locstm.where(self.loc))

sax.parse(locstm, Handler())

Of course you don’t need to keep all of the linelocs around — to save memory, you can drop “old” ones (below the latest one queried) but then you need to make linelocs a dict, etc.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am working with potentially huge XML files containing complex trace information from on

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply