Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6361815
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 24, 20262026-05-24T23:49:32+00:00 2026-05-24T23:49:32+00:00

I need to write a parser in Python that can process some extremely large

  • 0

I need to write a parser in Python that can process some extremely large files ( > 2 GB ) on a computer without much memory (only 2 GB). I wanted to use iterparse in lxml to do it.

My file is of the format:

<item>
  <title>Item 1</title>
  <desc>Description 1</desc>
</item>
<item>
  <title>Item 2</title>
  <desc>Description 2</desc>
</item>

and so far my solution is:

from lxml import etree

context = etree.iterparse( MYFILE, tag='item' )

for event, elem in context :
      print elem.xpath( 'description/text( )' )

del context

Unfortunately though, this solution is still eating up a lot of memory. I think the problem is that after dealing with each “ITEM” I need to do something to cleanup empty children. Can anyone offer some suggestions on what I might do after processing my data to properly cleanup?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-24T23:49:33+00:00Added an answer on May 24, 2026 at 11:49 pm

    Try Liza Daly’s fast_iter. After processing an element, elem, it calls elem.clear() to remove descendants and also removes preceding siblings.

    def fast_iter(context, func, *args, **kwargs):
        """
        http://lxml.de/parsing.html#modifying-the-tree
        Based on Liza Daly's fast_iter
        http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
        See also http://effbot.org/zone/element-iterparse.htm
        """
        for event, elem in context:
            func(elem, *args, **kwargs)
            # It's safe to call clear() here because no descendants will be
            # accessed
            elem.clear()
            # Also eliminate now-empty references from the root node to elem
            for ancestor in elem.xpath('ancestor-or-self::*'):
                while ancestor.getprevious() is not None:
                    del ancestor.getparent()[0]
        del context
    
    
    def process_element(elem):
        print elem.xpath( 'description/text( )' )
    
    context = etree.iterparse( MYFILE, tag='item' )
    fast_iter(context,process_element)
    

    Daly’s article is an excellent read, especially if you are processing large XML files.


    Edit: The fast_iter posted above is a modified version of Daly’s fast_iter. After processing an element, it is more aggressive at removing other elements that are no longer needed.

    The script below shows the difference in behavior. Note in particular that orig_fast_iter does not delete the A1 element, while the mod_fast_iter does delete it, thus saving more memory.

    import lxml.etree as ET
    import textwrap
    import io
    
    def setup_ABC():
        content = textwrap.dedent('''\
          <root>
            <A1>
              <B1></B1>
              <C>1<D1></D1></C>
              <E1></E1>
            </A1>
            <A2>
              <B2></B2>
              <C>2<D></D></C>
              <E2></E2>
            </A2>
          </root>
            ''')
        return content
    
    
    def study_fast_iter():
        def orig_fast_iter(context, func, *args, **kwargs):
            for event, elem in context:
                print('Processing {e}'.format(e=ET.tostring(elem)))
                func(elem, *args, **kwargs)
                print('Clearing {e}'.format(e=ET.tostring(elem)))
                elem.clear()
                while elem.getprevious() is not None:
                    print('Deleting {p}'.format(
                        p=(elem.getparent()[0]).tag))
                    del elem.getparent()[0]
            del context
    
        def mod_fast_iter(context, func, *args, **kwargs):
            """
            http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
            Author: Liza Daly
            See also http://effbot.org/zone/element-iterparse.htm
            """
            for event, elem in context:
                print('Processing {e}'.format(e=ET.tostring(elem)))
                func(elem, *args, **kwargs)
                # It's safe to call clear() here because no descendants will be
                # accessed
                print('Clearing {e}'.format(e=ET.tostring(elem)))
                elem.clear()
                # Also eliminate now-empty references from the root node to elem
                for ancestor in elem.xpath('ancestor-or-self::*'):
                    print('Checking ancestor: {a}'.format(a=ancestor.tag))
                    while ancestor.getprevious() is not None:
                        print(
                            'Deleting {p}'.format(p=(ancestor.getparent()[0]).tag))
                        del ancestor.getparent()[0]
            del context
    
        content = setup_ABC()
        context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C')
        orig_fast_iter(context, lambda elem: None)
        # Processing <C>1<D1/></C>
        # Clearing <C>1<D1/></C>
        # Deleting B1
        # Processing <C>2<D/></C>
        # Clearing <C>2<D/></C>
        # Deleting B2
    
        print('-' * 80)
        """
        The improved fast_iter deletes A1. The original fast_iter does not.
        """
        content = setup_ABC()
        context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C')
        mod_fast_iter(context, lambda elem: None)
        # Processing <C>1<D1/></C>
        # Clearing <C>1<D1/></C>
        # Checking ancestor: root
        # Checking ancestor: A1
        # Checking ancestor: C
        # Deleting B1
        # Processing <C>2<D/></C>
        # Clearing <C>2<D/></C>
        # Checking ancestor: root
        # Checking ancestor: A2
        # Deleting A1
        # Checking ancestor: C
        # Deleting B2
    
    study_fast_iter()
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm new to python and bencoding. I need to read and write torrent files
I need to write a simple parser that will convert the tokens to parser
I'm working with Quickbook's IIF file format and I need to write a parser
I need to write a simple parser to a sort of Domain Specific Language.
I need to write a little program in C that parses a string. I
I need write an update statement that used multiple tables to determine which rows
I need to write a Java Comparator class that compares Strings, however with one
I need to write a Delphi application that pulls entries up from various tables
I need to create a python module that will be installed on end-user machines.
I need to write a small tool that parses a textual input and generates

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.