I have been trying to parse a file with xml.etree.ElementTree : import xml.etree.ElementTree as

Question

0

Asked: May 26, 20262026-05-26T05:21:57+00:00 2026-05-26T05:21:57+00:00

I have been trying to parse a file with xml.etree.ElementTree : import xml.etree.ElementTree as

0

I have been trying to parse a file with xml.etree.ElementTree:

import xml.etree.ElementTree as ET
from xml.etree.ElementTree import ParseError

def analyze(xml):
    it = ET.iterparse(file(xml))
    count = 0
    last = None

    try:        
        for (ev, el) in it:
            count += 1
            last = el

    except ParseError:
            print("catastrophic failure")
            print("last successful: {0}".format(last))

    print('count: {0}'.format(count))

This is of course a simplified version of my code, but this is enough to break my program. I get this error with some files if I remove the try-catch block:

Traceback (most recent call last):
  File "<pyshell#22>", line 1, in <module>
    from yparse import analyze; analyze('file.xml')
  File "C:\Python27\yparse.py", line 10, in analyze
    for (ev, el) in it:
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1258, in next
    self._parser.feed(data)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1624, in feed
    self._raiseerror(v)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1488, in _raiseerror
    raise err
ParseError: reference to invalid character number: line 1, column 52459

The results are deterministic though, if a file works it will always work. If a file fails, it always fails and always fails at the same point.

The strangest thing is I’m using the trace to find out if I have any malformed XML that’s breaking the parser. I then isolate the node that caused the failure. But when I create an XML file containing that node and a few of its neighbors, the parsing works!

This doesn’t seem to be a size problem either. I have managed to parse much larger files with no problems.

Any ideas?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T05:21:57+00:00

As @John Machin suggested, the files in question do have dubious numeric entities in them, though the error messages seem to be pointing at the wrong place in the text. Perhaps the streaming nature and buffering are making it difficult to report accurate positions.

In fact, all of these entities appear in the text:

set(['&#x08;', '&#x0E;', '&#x1E;', '&#x1C;', '&#x18;', '&#x04;', '&#x0A;', '&#x0C;', '&#x16;', '&#x14;', '&#x06;', '&#x00;', '&#x10;', '&#x02;', '&#x0D;', '&#x1D;', '&#x0F;', '&#x09;', '&#x1B;', '&#x05;', '&#x15;', '&#x01;', '&#x03;'])

Most are not allowed. Looks like this parser is quite strict, you’ll need to find another that is not so strict, or pre-process the XML.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have been trying to parse a file with xml.etree.ElementTree : import xml.etree.ElementTree as

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply