I am trying to parse a gigantic log file (around 5 GB).
I only want to parse the first 500,000 lines and I don’t want to read the whole file into memory.
Basically, I want to do what the below is code is doing but with a while loop instead of a for loop and if conditional. I also want to be sure not read the entire file into memory.
import re
from collections import defaultdict
FILE = open('logs.txt', 'r')
count_words=defaultdict(int)
import pickle
i=0
for line in FILE.readlines():
if i < 500000:
m = re.search('key=([^&]*)', line)
count_words[m.group(1)]+=1
i+=1
csv=[]
for k, v in count_words.iteritems():
csv.append(k+","+str(v))
print "\n".join(csv)
Calling
readlines()will call the entire file into memory, so you’ll have to read line by line until you reach line 500,000 or hit the EOF, whichever comes first. Here’s what you should do instead: