first post,
be as critical as you want…
My problem:
I have a really large file of 140 million lines (file 1) and a slightly smaller file of 3 million lines (file 2). I want to remove those lines in file 1 that have matches to file 2. Intuitively this seemed like a simple find and remove problem that shouldn’t take that long..relatively. As my code stands its taking ~4 days to run on a 24Gb processor. I’d like to perform this on several files and so I would like an improvement on time.
Any help and comments would be much appreciated.
Sample file1:
reftig_0 43 0 1.0
reftig_0 44 1 1.0
reftig_0 45 0 1.0
reftig_0 46 1 1.0
reftig_0 47 0 5.0
Sample file 2:
reftig_0 43
reftig_0 44
reftig_0 45
Code:
data = open('file_1', 'r')
data_2 = open('file_2', 'r')
new_file = open('new_file_1', 'w')
d2= {}
for line in data_2:
line= line.rstrip()
fields = line.split(' ')
key = (fields[0], fields[1])
d2[key]=1
#print d2.keys()
#print d2['reftig_1']
tocheck=d2.keys()
tocheck.sort()
#print tocheck
for sline in data:
sline = sline.rstrip()
fields = sline.split(' ')
nkey = (fields[0],fields[1])
#print nkey
if nkey in tocheck:
pass
else:
new_file.write(sline + '\n')
#print sline
Writing short strings to
new_fileonce per line is slow. Reduce the number of writes by appending content to a list, and writing tonew_fileonly when the list is, say, 1000 lines long.Here is the result of running
time test.pyfor various values ofN: