Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8216537
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 7, 20262026-06-07T12:09:37+00:00 2026-06-07T12:09:37+00:00

I have been working on code that parses external XML-files. Some of these files

  • 0

I have been working on code that parses external XML-files. Some of these files are huge, up to gigabytes of data. Needless to say, these files need to be parsed as a stream because loading them into memory is much too inefficient and often leads to OutOfMemory troubles.

I have used the libraries miniDOM, ElementTree, cElementTree and I am currently using lxml.
Right now I have a working, pretty memory-efficient script, using lxml.etree.iterparse. The problem is that some of the XML files I need to parse contain encoding errors (they advertise as UTF-8, but contain differently encoded characters). When using lxml.etree.parse this can be fixed by using the recover=True option of a custom parser, but iterparse does not accept a custom parser. (see also: this question)

My current code looks like this:

from lxml import etree
events = ("start", "end")
context = etree.iterparse(xmlfile, events=events)
event, root_element = context.next() # <items>
for action, element in context:
    if action == 'end' and element.tag == 'item':
    # <parse>
    root_element.clear() 

Error when iterparse encounters a bad character (in this case, it’s a ^Y):

lxml.etree.XMLSyntaxError: Input is not proper UTF-8, indicate encoding !
Bytes: 0x19 0x73 0x20 0x65, line 949490, column 25

I don’t even wish to decode this data, I can just drop it. However I don’t know any way to skip the element – I tried context.next and continue in try/except statements.

Any help would be appreciated!

Update

Some additional info:
This is the line where iterparse fails:

<description><![CDATA:[musea de la photographie fonds mercator. Met meer dan 80.000 foto^Ys en 3 miljoen negatieven is het Muse de la...]]></description>

According to etree, the error occurs at bytes 0x19 0x73 0x20 0x65.
According to hexedit, 19 73 20 65 translates to ASCII .s e
The . in this place should be an apostrophe (foto’s).

I also found this question, which does not provide a solution.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-07T12:09:39+00:00Added an answer on June 7, 2026 at 12:09 pm

    Since the problem is being caused by illegal XML characters, in this case the 0x19 byte, I decided to strip them off. I found the following regular expression on this site:

    invalid_xml = re.compile(u'[\x00-\x08\x0B-\x0C\x0E-\x1F\x7F]')
    

    And I wrote this piece of code that removes illegal bytes while saving an xml feed:

    conn = urllib2.urlopen(xmlfeed)
    xmlfile = open('output', 'w')
    
    while True:
        data = conn.read(4096)
        if data:
            newdata, count = invalid_xml.subn('', data)
            if count > 0 :
                print 'Removed %s illegal characters from XML feed' % count
            xmlfile.write(newdata)
    
        else:
            break
    
    xmlfile.close()
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Hi I have been working on this code that creates a table with radio
I am new to github. I have been working on a code that I
I have been working on some code to submit an uploaded image to PHP
I have a code that has been working for almost 4 years (since boost
I have some code that downloads lessons as JSON, parses them, and puts them
I have been working on information extraction and was able to run standAloneAnnie.java http://gate.ac.uk/wiki/code-repository/src/sheffield/examples/StandAloneAnnie.java
I have code which as been working against an older Active Directory server and
Hey there. I have a c# application that parses txt files and imports the
I have been working on this problem for quite some time and can't figure
I've been working with xml lately. And have noticed a bit of a phenomenon

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.