I am reading a large file containing various <xml>..</xml> elements. Since every XML parser

Question

0

Asked: May 23, 20262026-05-23T05:07:19+00:00 2026-05-23T05:07:19+00:00

I am reading a large file containing various <xml>..</xml> elements. Since every XML parser

0

I am reading a large file containing various <xml>..</xml> elements. Since every XML parser has trouble with that, I would like to produce efficiently new file objects for each <xml>..</xml> block.

I was starting to subclass the file object in Python, but got stucked there. I think, I’ve to intercept each line starting with </xml> and return a new file object; maybe by using yield.

Can someone guide me to do the step in the right direction?

Here is my current code fragment:

#!/bin/bash/env python

from lxml import etree
from StringIO import StringIO

class handler(file):
  def __init__(self, name, mode):
    file.__init__(self, name, mode)

  def next(self):
    return file.next(self)

  def listXmls(self):
    output = StringIO()
    line = self.next()
    while line is not None:
      output.write(line.strip())
      if line.strip() == '</xml>':
        yield output
        output = StringIO()
      try:
        line = self.next()
      except StopIteration:
        break
    output.close()

f = handler('myxml.xml', 'r')
for elem in f.listXmls():
  print 'm' + elem.getvalue() + 'm'
  context = etree.iterparse(elem, events=('end',), tag='id')
  for event, element in context:
    print element.tag

Thanks!

SOLUTION (still interested in a better version):

#!/bin/bash/env python

from lxml import etree
from StringIO import StringIO

class handler(file):
  def __init__(self, name, mode):
    file.__init__(self, name, mode)

  def next(self):
    return file.next(self)

  def listXmls(self):
    output = StringIO()
    output.write(self.next())
    line = self.next()
    while line is not None:
      if line.startswith('<?xml'):
        output.seek(0)
        yield output
        output = StringIO()
      output.write(line)
      try:
        line = self.next()
      except StopIteration:
        break
    output.seek(0)
    yield output

f = handler('myxml.xml', 'r')
for elem in f.listXmls():
  context = etree.iterparse(elem, events=('end',), tag='id')
  for event, element in context:
    print element.tag

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T05:07:20+00:00

Editorial Team

2026-05-23T05:07:20+00:00Added an answer on May 23, 2026 at 5:07 am

While not a direct answer to your question, this may solve your problem anyway: Simply adding another <xml> at the beginning and another </xml> at the end will probably make your XML parser accept the document:

from lxml import etree
document = "<xml>a</xml> <xml>b</xml>"
document = "<xml>" + document + "</xml>"
for subdocument in etree.XML(document):
    # whatever

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am reading a large file containing various <xml>..</xml> elements. Since every XML parser

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply