Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9297271
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 18, 20262026-06-18T22:01:47+00:00 2026-06-18T22:01:47+00:00

I am reading an 800 GB xml file in python 2.7 and parsing it

  • 0

I am reading an 800 GB xml file in python 2.7 and parsing it with an etree iterative parser.

Currently, I am just using open('foo.txt') with no buffering argument. I am a little confused whether this is the approach I should take or I should use a buffering argument or use something from io like io.BufferedReader or io.open or io.TextIOBase.

A point in the right direction would be much appreciated.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-18T22:01:48+00:00Added an answer on June 18, 2026 at 10:01 pm

    The standard open() function already, by default, returns a buffered file (if available on your platform). For file objects that is usually fully buffered.

    Usually here means that Python leaves this to the C stdlib implementation; it uses a fopen() call (wfopen() on Windows to support UTF-16 filenames), which means that the default buffering for a file is chosen; on Linux I believe that would be 8kb. For a pure-read operation like XML parsing this type of buffering is exactly what you want.

    The XML parsing done by iterparse reads the file in chunks of 16384 bytes (16kb).

    If you want to control the buffersize, use the buffering keyword argument:

    open('foo.xml', buffering=(2<<16) + 8)  # buffer enough for 8 full parser reads
    

    which will override the default buffer size (which I’d expect to match the file block size or a multiple thereof). According to this article increasing the read buffer should help, and using a size at least 4 times the expected read block size plus 8 bytes is going to improve read performance. In the above example I’ve set it to 8 times the ElementTree read size.

    The io.open() function represents the new Python 3 I/O structure of objects, where I/O has been split up into a new hierarchy of class types to give you more flexibility. The price is more indirection, more layers for the data to have to travel through, and the Python C code does more work itself instead of leaving that to the OS.

    You could try and see if io.open('foo.xml', 'rb', buffering=2<<16) is going to perform any better. Opening in rb mode will give you a io.BufferedReader instance.

    You do not want to use io.TextIOWrapper; the underlying expat parser wants raw data as it’ll decode your XML file encoding itself. It would only add extra overhead; you get this type if you open in r (textmode) instead.

    Using io.open() may give you more flexibility and a richer API, but the underlying C file object is opened using open() instead of fopen(), and all buffering is handled by the Python io.BufferedIOBase implementation.

    Your problem will be processing this beast, not the file reads, I think. The disk cache will be pretty much shot anyway when reading a 800GB file.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Reading XML from a file into a variable can be done like this: [xml]$x
Reading a file in using a Mapped FileChannel seems to be lightning fast... But
I am experimenting with reading the width and height of a PNG file. This
Reading what is currently the top answer to a recent question on how/whether to
I'm getting started with using python's mrjob to convert some of my long running
I am reading a *.srt subtitle file into a NSString. The content of this
Currently I'm playing with faceted search after reading RavenDB doc about it. The result
Reading the docs, I'd expect $(#wrap2).remove(.error) to remove all .error elements from #wrap2 .
Reading across difference lineage of CPU created by intel , many questions aroused in
Reading manual about Sling http://sling.apache.org/site/46-line-blog.html added folder blog and blog.html to destination: \launchpad\content\src\main\resources\content\ but

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.