I’m trying to read a file in python (scan it lines and look for

Question

0

Asked: May 26, 20262026-05-26T07:08:07+00:00 2026-05-26T07:08:07+00:00

I’m trying to read a file in python (scan it lines and look for

0

I’m trying to read a file in python (scan it lines and look for terms) and write the results- let say, counters for each term. I need to do that for a big amount of files (more than 3000). Is it possible to do that multi threaded? If yes, how?

So, the scenario is like this:

Read each file and scan its lines
Write counters to same output file for all the files I’ve read.

Second question is, does it improve the speed of read/write.

Hope it is clear enough. Thanks,

Ron.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T07:08:08+00:00

I agree with @aix, multiprocessing is definitely the way to go. Regardless you will be i/o bound — you can only read so fast, no matter how many parallel processes you have running. But there can easily be some speedup.

Consider the following (input/ is a directory that contains several .txt files from Project Gutenberg).

import os.path
from multiprocessing import Pool
import sys
import time

def process_file(name):
    ''' Process one file: count number of lines and words '''
    linecount=0
    wordcount=0
    with open(name, 'r') as inp:
        for line in inp:
            linecount+=1
            wordcount+=len(line.split(' '))

    return name, linecount, wordcount

def process_files_parallel(arg, dirname, names):
    ''' Process each file in parallel via Poll.map() '''
    pool=Pool()
    results=pool.map(process_file, [os.path.join(dirname, name) for name in names])

def process_files(arg, dirname, names):
    ''' Process each file in via map() '''
    results=map(process_file, [os.path.join(dirname, name) for name in names])

if __name__ == '__main__':
    start=time.time()
    os.path.walk('input/', process_files, None)
    print "process_files()", time.time()-start

    start=time.time()
    os.path.walk('input/', process_files_parallel, None)
    print "process_files_parallel()", time.time()-start

When I run this on my dual core machine there is a noticeable (but not 2x) speedup:

$ python process_files.py
process_files() 1.71218085289
process_files_parallel() 1.28905105591

If the files are small enough to fit in memory, and you have lots of processing to be done that isn’t i/o bound, then you should see even better improvement.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to read a file in python (scan it lines and look for

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply