I have been working on code that parses external XML-files. Some of these files

Question

0

Asked: June 7, 20262026-06-07T12:09:37+00:00 2026-06-07T12:09:37+00:00

I have been working on code that parses external XML-files. Some of these files

0

I have been working on code that parses external XML-files. Some of these files are huge, up to gigabytes of data. Needless to say, these files need to be parsed as a stream because loading them into memory is much too inefficient and often leads to OutOfMemory troubles.

I have used the libraries miniDOM, ElementTree, cElementTree and I am currently using lxml.
Right now I have a working, pretty memory-efficient script, using lxml.etree.iterparse. The problem is that some of the XML files I need to parse contain encoding errors (they advertise as UTF-8, but contain differently encoded characters). When using lxml.etree.parse this can be fixed by using the recover=True option of a custom parser, but iterparse does not accept a custom parser. (see also: this question)

My current code looks like this:

from lxml import etree
events = ("start", "end")
context = etree.iterparse(xmlfile, events=events)
event, root_element = context.next() # <items>
for action, element in context:
    if action == 'end' and element.tag == 'item':
    # <parse>
    root_element.clear()

Error when iterparse encounters a bad character (in this case, it’s a ^Y):

lxml.etree.XMLSyntaxError: Input is not proper UTF-8, indicate encoding !
Bytes: 0x19 0x73 0x20 0x65, line 949490, column 25

I don’t even wish to decode this data, I can just drop it. However I don’t know any way to skip the element – I tried context.next and continue in try/except statements.

Any help would be appreciated!

Update

Some additional info:
This is the line where iterparse fails:

<description><![CDATA:[musea de la photographie fonds mercator. Met meer dan 80.000 foto^Ys en 3 miljoen negatieven is het Muse de la...]]></description>

According to etree, the error occurs at bytes 0x19 0x73 0x20 0x65.
According to hexedit, 19 73 20 65 translates to ASCII .s e
The . in this place should be an apostrophe (foto’s).

I also found this question, which does not provide a solution.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-07T12:09:39+00:00

Since the problem is being caused by illegal XML characters, in this case the 0x19 byte, I decided to strip them off. I found the following regular expression on this site:

invalid_xml = re.compile(u'[\x00-\x08\x0B-\x0C\x0E-\x1F\x7F]')

And I wrote this piece of code that removes illegal bytes while saving an xml feed:

conn = urllib2.urlopen(xmlfeed)
xmlfile = open('output', 'w')

while True:
    data = conn.read(4096)
    if data:
        newdata, count = invalid_xml.subn('', data)
        if count > 0 :
            print 'Removed %s illegal characters from XML feed' % count
        xmlfile.write(newdata)

    else:
        break

xmlfile.close()

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have been working on code that parses external XML-files. Some of these files

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply