I’m trying to get a handle on multithreading in python. I have working code

Question

0

Asked: June 5, 20262026-06-05T01:53:56+00:00 2026-06-05T01:53:56+00:00

I’m trying to get a handle on multithreading in python. I have working code

0

I’m trying to get a handle on multithreading in python. I have working code that calculates the number of words, the number of lines with text, and creates a dict with the count of each word. It runs fast on small files like the one noted in the code comments. However I usually use glob to pull in multiple files. When I do I have significantly increased run times. Meanwhile since my script was single threaded I see that I have 3 other cores sitting idle while one maxes out.

I thought I would give pythons multithreading module a shot, here’s what I have done so far (non-working):

#!/bin/python
#
# test file: http://www.gutenberg.org/ebooks/2852.txt.utf-8

import fileinput
from collections import defaultdict
import threading
import time

inputfilename = 'pg2852.txt'

exitFlag = 0

line = []
line_counter = 0
tot_words = 0
word_dict = defaultdict(int)

def myCounters( threadName, delay):
        for line in fileinput.input([inputfilename]):
                line = line.strip();
                if not line: continue
                words = line.split()
                tot_words += len(words)
                line_counter += 1
                for word in words:
                        word_dict[word] += 1

        print "%s: %s:" %( threadName, time.ctime(time.time()) )
        print word_dict
        print "Total Words: ", tot_words
        print "Total Lines: ", line_counter

try:
        thread.start_new_thread( myCounters, ("Thread-1", 2, ) )
        thread.start_new_thread( myCounters, ("Thread-2", 4, ) )
except:
        print "Error: Thread Not Started"

while 1:
        pass

For those of you who try this code, it doesn’t work. I assume that I need to break the input file into chunks and merge the output somehow. ? map/reduce ? perhaps there’s a simpler solution?

Edit:

Maybe something like:

open the file,
break it into chunks
feed each chunk to a different thread
get counts and build dict on each chunk
merge counts / dict
return results

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-05T01:53:57+00:00

First of all, you are correct that you need to use parallel processes rather than parallel threads. Doing this kind of task [see ETA below] will not scale well to multiple threads under python, due to the Global Interpreter Lock (GIL).

If you wanted to process a single file in parallel, the obvious way would be to first check the file size, then assign equal-sized chunks to multiple processes. That would just involve telling each process from what position in the file to start, and what position to end. (Of course, you would have to be careful not to count any words or lines twice. A simple approach would be to have each process ignore the initial bytes until it gets to the start of a line, and then start counting).

However, you state in your question that you will be using a glob to process multiple files. So instead of taking the complex route of chunking files and assigning the chunks to different processes, an easier option is simply assigning different files to different processes.

ETA:

Using threads in Python is suitable for certain use cases, such as using I/O functions that block for a long time. @uselpa is right that if processing is I/O bound then threads may perform well, but that is not the case here because the bottleneck is actually the parsing, not the file I/O. This is due to the performance characteristics of Python as an interpreted language; in a compiled language, the I/O is more likely to be the bottleneck.

I make these claims because I have just done some measuring based on the original code (using a test file containing 100 concatenated copies of pg2852.txt):

Running as a single thread took about 2.6s to read and parse the file, but only 0.2s when I commented out the parsing code.
Running two threads in parallel (reading from the same file) took 7.2s, but two single-threaded processes launched in parallel took only 3.3s to both complete.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to get a handle on multithreading in python. I have working code

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply