Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 5838425
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 22, 20262026-05-22T11:27:59+00:00 2026-05-22T11:27:59+00:00

I want to parse a huge file xml-file. The records in this huge file

  • 0

I want to parse a huge file xml-file. The records in this huge file do look for example like this. And in general the file looks like this

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
    record_1
    ...
    record_n
</dblp>

I wrote some code, that shall get me a selection of recordings from this file.

If I let the code run (takes nearly 50 Minutes including storage in the MySQL-Database) I notice, that there is a record, which seams to have nearly a million authors. This must be wrong. I even checked up on it by looking into the file make sure, that the file has no errors in it. The paper has only 5 or 6 authors, so all is fine wirh dblp.xml. So I assume a logic error in my code. But I can’t figure out where this could be. Perhaps someone can tell me, where the error is?

The code stops in the line if len(auth) > 2000.

import sys
import MySQLdb
from lxml import etree


elements = ['article', 'inproceedings', 'proceedings', 'book', 'incollection']
tags = ["author", "title", "booktitle", "year", "journal"]


def fast_iter(context, cursor):
    mydict = {} # represents a paper with all its tags.
    auth = [] # a list of authors who have written the paper "together".
    counter = 0 # counts the papers

    for event, elem in context:
        if elem.tag in elements and event == "start":
            mydict["element"] = elem.tag
            mydict["mdate"] = elem.get("mdate")
            mydict["key"] = elem.get("key")

        elif elem.tag == "title" and elem.text != None:
            mydict["title"] = elem.text
        elif elem.tag == "booktitle" and elem.text != None:
            mydict["booktitle"] = elem.text
        elif elem.tag == "year" and elem.text != None:
            mydict["year"] = elem.text
        elif elem.tag == "journal" and elem.text != None:
            mydict["journal"] = elem.text
        elif elem.tag == "author" and elem.text != None:
            auth.append(elem.text)
        elif event == "end" and elem.tag in elements:
            counter += 1
            print counter
            #populate_database(mydict, auth, cursor)
            mydict.clear()
            auth = []
            if mydict or auth:
                sys.exit("Program aborted because auth or mydict was not deleted properly!")
        if len(auth) > 200: # There are up to ~150 authors per paper. 
            sys.exit("auth: It seams there is a paper which has too many authors.!")
        if len(mydict) > 50: # A paper can have much metadata.
            sys.exit("mydict: It seams there is a paper which has too many tags.")

        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context


def main():
        cursor = connectToDatabase()
        cursor.execute("""SET NAMES utf8""")

        context = etree.iterparse(PATH_TO_XML, dtd_validation=True, events=("start", "end"))
        fast_iter(context, cursor)

        cursor.close()


if __name__ == '__main__':
    main()

EDIT:

I was totally misguided, when I wrote this function. I made a huge mistake by overlooking, that while trying to skip some unwanted recordings the get messed up with some wanted recordings. And at a certain point in the file, where I skiped nearly a million records in a row, the following wanted record got blown up.

With the help of John and Paul I managed to rewrite my code. It is parsing right now, and seams to do it well. I’ll report back, if some unexpected errors remained unsolved. Elsewise thank you all for your help! I really appreciated it!

def fast_iter2(context, cursor):
    elements = set([
        'article', 'inproceedings', 'proceedings', 'book', 'incollection',
        'phdthesis', "mastersthesis", "www"
        ])
    childElements = set(["title", "booktitle", "year", "journal", "ee"])

    paper = {} # represents a paper with all its tags.
    authors = []   # a list of authors who have written the paper "together".
    paperCounter = 0
    for event, element in context:
        tag = element.tag
        if tag in childElements:
            if element.text:
                paper[tag] = element.text
                # print tag, paper[tag]
        elif tag == "author":
            if element.text:
                authors.append(element.text)
                # print "AUTHOR:", authors[-1]
        elif tag in elements:
            paper["element"] = tag
            paper["mdate"] = element.get("mdate")
            paper["dblpkey"] = element.get("key")
            # print tag, element.get("mdate"), element.get("key"), event
            if paper["element"] in ['phdthesis', "mastersthesis", "www"]:
                pass
            else:
                populate_database(paper, authors, cursor)
            paperCounter += 1
            print paperCounter
            paper = {}
            authors = []
            # if paperCounter == 100:
            #     break
            element.clear()
            while element.getprevious() is not None:
                del element.getparent()[0]
    del context
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-22T11:28:00+00:00Added an answer on May 22, 2026 at 11:28 am

    Please eliminate one source of confusion: You haven’t actually said that the code that you showed does actually trip over on one of your “count of things > 2000” tests. If not, then the problem lies in the database update code (that you haven’t showed us).

    If it does so trip over:

    (1) Reduce the limits from 2000 to reasonable values (about 20 for auth and exactly 7 for mydict)

    (2) When the trip happens, print repr(mydict); print; print repr(auth) and analyse the contents in comparison with your file.

    Aside: with iterparse(), elem.text is NOT guaranteed to have been parsed when the “start” event happens. To save some running time, you should access elem.text only when the “end” event happens. In fact, there seems to be no reason why you want “start” events at all. Also you define a list tags but never use it. The start of your function could be written much more concisely:

    def fast_iter(context, cursor):
        mydict = {} # represents a paper with all its tags.
        auth = [] # a list of authors who have written the paper "together".
        counter = 0 # counts the papers
        tagset1 = set(['article', 'inproceedings', 'proceedings', 'book', 'incollection'])
        tagset2 = set(["title", "booktitle", "year", "journal"])
        for event, elem in context:
            tag = elem.tag
            if tag in tagset2:
                if elem.text:
                    mydict[tag] = elem.text
            elif tag == "author":
                if elem.text:
                    auth.append(elem.text)
            elif tag in tagset1:
                counter += 1
                print counter
                mydict["element"] = tag
                mydict["mdate"] = elem.get("mdate")
                mydict["dblpkey"] = elem.get("key")
                #populate_database(mydict, auth, cursor)
                mydict.clear() # Why not just do mydict = {} ??
                auth = []
                # etc etc
    

    Don’t forget to fix the call to iterparse() to remove the events arg.

    Also I’m reasonably certain that the elem.clear() should be done only when event is “end” and needs to be done only when tag in tagset1. Read the relevant docs carefully. Doing the cleanup in a “start” event could very well be damaging your tree.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a simple but huge xml file like below. I want to parse
I want to parse a xml file using xquery in Ruby, I found this
I have an XML file that looks like this: <country> <routes> <SourceCountry>Ireland</SourceCountry> <SourcePort>Larne</SourcePort> <DestinationCountry>UK</DestinationCountry>
I want to parse something like this: Hi [{tagname:content}] [{tag1:xnkudfdhkfujhkdjki diidfo now nested tag
I want to parse the xml file with dynamic content using DOM parser in
I want to parse xml file in utf-8 and sort it by some field.
I want to parse a XML file, change some attributes and write the results
I have Date that look like String, and I want parse it. But date
I am trying to parse a huge XML file ranging from (20MB-3GB). Files are
i want parse string like this 0.1142 0.0000 0.0000 0.0004 0.0000 0.0000 2299/2299 MakeRequest

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.