Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 565325
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 13, 20262026-05-13T12:51:37+00:00 2026-05-13T12:51:37+00:00

I have a large XML data file (>160M) to process, and it seems like

  • 0

I have a large XML data file (>160M) to process, and it seems like SAX/expat/pulldom parsing is the way to go. I’d like to have a thread that sifts through the nodes and pushes nodes to be processed onto a queue, and then other worker threads pull the next available node off the queue and process it.

I have the following (it should have locks, I know – it will, later)

import sys, time
import xml.parsers.expat
import threading

q = []

def start_handler(name, attrs):
    q.append(name)

def do_expat():
    p = xml.parsers.expat.ParserCreate()
    p.StartElementHandler = start_handler
    p.buffer_text = True
    print("opening {0}".format(sys.argv[1]))
    with open(sys.argv[1]) as f:
        print("file is open")
        p.ParseFile(f)
        print("parsing complete")


t = threading.Thread(group=None, target=do_expat)
t.start()

while True:
    print(q)
    time.sleep(1)

The problem is that the body of the while block gets called only once, and then I can’t even ctrl-C interrupt it. On smaller files, the output is as expected, but that seems to indicate that the handler only gets called when the document is fully parsed, which seems to defeat the purpose of a SAX parser.

I’m sure it’s my own ignorance, but I don’t see where I’m making the mistake.

PS: I also tried changing start_handler thus:

def start_handler(name, attrs):
    def app():
        q.append(name)
    u = threading.Thread(group=None, target=app)
    u.start()

No love, though.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-13T12:51:37+00:00Added an answer on May 13, 2026 at 12:51 pm

    ParseFile, as you’ve noticed, just “gulps down” everything — no good for the incremental parsing you want to do! So, just feed the file to the parser a bit at a time, making sure to conditionally yield control to other threads as you go — e.g.:

    while True:
      data = f.read(BUFSIZE)
      if not data:
        p.Parse('', True)
        break
      p.Parse(data, False)
      time.sleep(0.0)
    

    the time.sleep(0.0) call is Python’s way to say “yield to other threads if any are ready and waiting”; the Parse method is documented here.

    The second point is, forget locks for this usage! — use Queue.Queue instead, it’s intrinsically threadsafe and almost invariably the best and simplest way to coordinate multiple threads in Python. Just make a Queue instance q, q.put(name) on it, and have worked threads block on q.get() waiting to get some more work to do — it’s SO simple!

    (There are several auxiliary strategies you can use to coordinate the termination of worker threads when there’s no more work for them to do, but the simplest, absent special requirements, is to just make them daemon threads, so they will all terminate when the main thread does — see the docs).

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a very large XML file which has like 40000 data, and when
I have a large XML file that looks like <data> skdfnlsniisimsoinfsdfoisdfinsdofinodnfonf <emrosem> 23324097234097g </emrosem>
I have a large xml file that looks like this: 20120124 07:30:15.301, saving to
I have a large amount of data stored in an XML file, 173 MB
I have to send a large amount of xml data through sockets. Example of
I have a large xml file (approx. 10 MB) in following simple structure: <Errors>
I have a large XML, looking like this: <gender>M</gender> <last-name>*</last-name> <profession>2165dda2-dc59-41af-acb5-06d8914c4841</profession> <first-name>*</first-name> <mail-confirmation>1</mail-confirmation> <fax-confirmation>1</fax-confirmation>
I have a large XML file which in the middle contains the following: <ArticleName>Article
I have a large XML file (many MBs) that I cannot afford to download
I have a request that returns a large xml file. I have the file

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.