I have a large XML data file (>160M) to process, and it seems like SAX/expat/pulldom parsing is the way to go. I’d like to have a thread that sifts through the nodes and pushes nodes to be processed onto a queue, and then other worker threads pull the next available node off the queue and process it.
I have the following (it should have locks, I know – it will, later)
import sys, time
import xml.parsers.expat
import threading
q = []
def start_handler(name, attrs):
q.append(name)
def do_expat():
p = xml.parsers.expat.ParserCreate()
p.StartElementHandler = start_handler
p.buffer_text = True
print("opening {0}".format(sys.argv[1]))
with open(sys.argv[1]) as f:
print("file is open")
p.ParseFile(f)
print("parsing complete")
t = threading.Thread(group=None, target=do_expat)
t.start()
while True:
print(q)
time.sleep(1)
The problem is that the body of the while block gets called only once, and then I can’t even ctrl-C interrupt it. On smaller files, the output is as expected, but that seems to indicate that the handler only gets called when the document is fully parsed, which seems to defeat the purpose of a SAX parser.
I’m sure it’s my own ignorance, but I don’t see where I’m making the mistake.
PS: I also tried changing start_handler thus:
def start_handler(name, attrs):
def app():
q.append(name)
u = threading.Thread(group=None, target=app)
u.start()
No love, though.
ParseFile, as you’ve noticed, just “gulps down” everything — no good for the incremental parsing you want to do! So, just feed the file to the parser a bit at a time, making sure to conditionally yield control to other threads as you go — e.g.:the
time.sleep(0.0)call is Python’s way to say “yield to other threads if any are ready and waiting”; theParsemethod is documented here.The second point is, forget locks for this usage! — use Queue.Queue instead, it’s intrinsically threadsafe and almost invariably the best and simplest way to coordinate multiple threads in Python. Just make a
Queueinstanceq,q.put(name)on it, and have worked threads block onq.get()waiting to get some more work to do — it’s SO simple!(There are several auxiliary strategies you can use to coordinate the termination of worker threads when there’s no more work for them to do, but the simplest, absent special requirements, is to just make them daemon threads, so they will all terminate when the main thread does — see the docs).