first post, be as critical as you want… My problem: I have a really

Question

0

Asked: June 13, 20262026-06-13T02:51:22+00:00 2026-06-13T02:51:22+00:00

first post, be as critical as you want… My problem: I have a really

0

first post,
be as critical as you want…

My problem:
I have a really large file of 140 million lines (file 1) and a slightly smaller file of 3 million lines (file 2). I want to remove those lines in file 1 that have matches to file 2. Intuitively this seemed like a simple find and remove problem that shouldn’t take that long..relatively. As my code stands its taking ~4 days to run on a 24Gb processor. I’d like to perform this on several files and so I would like an improvement on time.
Any help and comments would be much appreciated.

Sample file1:

reftig_0 43 0 1.0
reftig_0 44 1 1.0
reftig_0 45 0 1.0
reftig_0 46 1 1.0
reftig_0 47 0 5.0

Sample file 2:

reftig_0 43
reftig_0 44
reftig_0 45

Code:

data = open('file_1', 'r')
data_2 = open('file_2', 'r')
new_file = open('new_file_1', 'w')

d2= {}
for line in data_2:
    line= line.rstrip()
    fields = line.split(' ')
    key = (fields[0], fields[1])
    d2[key]=1

#print d2.keys()
#print d2['reftig_1']
tocheck=d2.keys()
tocheck.sort()
#print tocheck

for sline in data:
    sline = sline.rstrip()
    fields = sline.split(' ')
    nkey = (fields[0],fields[1])
    #print nkey
    if nkey in tocheck:
        pass
    else:
        new_file.write(sline + '\n')
        #print sline

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-13T02:51:23+00:00

Writing short strings to new_file once per line is slow. Reduce the number of writes by appending content to a list, and writing to new_file only when the list is, say, 1000 lines long.

N = 1000
with open('/tmp/out', 'w') as f:
    result = []
    for x in range(10**7):
        result.append('Hi\n')
        if len(result) >= N:
            f.write(''.join(result))
            result = []

Here is the result of running time test.py for various values of N:

|      N | time (sec) |
|      1 |      5.879 |
|     10 |      2.781 |
|    100 |      2.417 |
|   1000 |      2.325 |
|  10000 |      2.299 |
| 100000 |      2.309 |

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

first post, be as critical as you want… My problem: I have a really

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply