I have to prepend some arbitrary text to an existing, but very large (2 – 10 GB range) text file. With the file being so large, I’m trying to avoid reading the entire file in to memory. But am I being too conservative with a line-by-line iteration? Would moving to a readlines(sizehint) approach give me much of a performance advantage over my current approach?
The delete-and-move at the end is less than ideal but, as far as I know, there’s no way to do this sort of manipulation with linear data, in place. But I’m not so well versed in Python — maybe there’s something unique to Python I can exploit to do this better?
import os
import shutil
def prependToFile(f, text):
f_temp = generateTempFileName(f)
inFile = open(f, 'r')
outFile = open(f_temp, 'w')
outFile.write('# START\n')
outFile.write('%s\n' % str(text))
outFile.write('# END\n\n')
for line in inFile:
outFile.write(line)
inFile.close()
outFile.close()
os.remove(f)
shutil.move(f_temp, f)
What you want to do is read the file in large (anywhere from 64k to several MB) blocks and write the blocks out. In other words, instead of individual lines, use huge blocks. That way you do the fewest I/Os possible and hopefully your process is I/O-bound instead of CPU-bound.