Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8034077
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 5, 20262026-06-05T01:53:56+00:00 2026-06-05T01:53:56+00:00

I’m trying to get a handle on multithreading in python. I have working code

  • 0

I’m trying to get a handle on multithreading in python. I have working code that calculates the number of words, the number of lines with text, and creates a dict with the count of each word. It runs fast on small files like the one noted in the code comments. However I usually use glob to pull in multiple files. When I do I have significantly increased run times. Meanwhile since my script was single threaded I see that I have 3 other cores sitting idle while one maxes out.

I thought I would give pythons multithreading module a shot, here’s what I have done so far (non-working):

#!/bin/python
#
# test file: http://www.gutenberg.org/ebooks/2852.txt.utf-8

import fileinput
from collections import defaultdict
import threading
import time

inputfilename = 'pg2852.txt'

exitFlag = 0

line = []
line_counter = 0
tot_words = 0
word_dict = defaultdict(int)

def myCounters( threadName, delay):
        for line in fileinput.input([inputfilename]):
                line = line.strip();
                if not line: continue
                words = line.split()
                tot_words += len(words)
                line_counter += 1
                for word in words:
                        word_dict[word] += 1

        print "%s: %s:" %( threadName, time.ctime(time.time()) )
        print word_dict
        print "Total Words: ", tot_words
        print "Total Lines: ", line_counter

try:
        thread.start_new_thread( myCounters, ("Thread-1", 2, ) )
        thread.start_new_thread( myCounters, ("Thread-2", 4, ) )
except:
        print "Error: Thread Not Started"

while 1:
        pass

For those of you who try this code, it doesn’t work. I assume that I need to break the input file into chunks and merge the output somehow. ? map/reduce ? perhaps there’s a simpler solution?

Edit:

Maybe something like:

  1. open the file,
  2. break it into chunks
  3. feed each chunk to a different thread
  4. get counts and build dict on each chunk
  5. merge counts / dict
  6. return results
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-05T01:53:57+00:00Added an answer on June 5, 2026 at 1:53 am

    First of all, you are correct that you need to use parallel processes rather than parallel threads. Doing this kind of task [see ETA below] will not scale well to multiple threads under python, due to the Global Interpreter Lock (GIL).

    If you wanted to process a single file in parallel, the obvious way would be to first check the file size, then assign equal-sized chunks to multiple processes. That would just involve telling each process from what position in the file to start, and what position to end. (Of course, you would have to be careful not to count any words or lines twice. A simple approach would be to have each process ignore the initial bytes until it gets to the start of a line, and then start counting).

    However, you state in your question that you will be using a glob to process multiple files. So instead of taking the complex route of chunking files and assigning the chunks to different processes, an easier option is simply assigning different files to different processes.


    ETA:

    Using threads in Python is suitable for certain use cases, such as using I/O functions that block for a long time. @uselpa is right that if processing is I/O bound then threads may perform well, but that is not the case here because the bottleneck is actually the parsing, not the file I/O. This is due to the performance characteristics of Python as an interpreted language; in a compiled language, the I/O is more likely to be the bottleneck.

    I make these claims because I have just done some measuring based on the original code (using a test file containing 100 concatenated copies of pg2852.txt):

    • Running as a single thread took about 2.6s to read and parse the file, but only 0.2s when I commented out the parsing code.
    • Running two threads in parallel (reading from the same file) took 7.2s, but two single-threaded processes launched in parallel took only 3.3s to both complete.
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm working with an upstream system that sometimes sends me text destined for HTML/XML
I have a string like this: La Torre Eiffel paragonata all’Everest What PHP function
I have this code to decode numeric html entities to the UTF8 equivalent character.
I have a French site that I want to parse, but am running into
I'm parsing an RSS feed that has an ’ in it. SimpleXML turns this
I have this code: - (void)parser:(NSXMLParser *)parser foundCDATA:(NSData *)CDATABlock { NSString *someString = [[NSString
I have a text area in my form which accepts all possible characters from
I'm trying to create an if statement in PHP that prevents a single post
I have a reasonable size flat file database of text documents mostly saved in
I am trying to loop through a bunch of documents I have to put

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.