I have to read a file from a particular line number and I know the line number say "n":
I have been thinking of two ways:
1.
for i in range(n):
fname.readline()
k=readline()
print k
2.
i=0
for line in fname:
dictionary[i]=line
i=i+1
but I want a faster alternative as I might have to perform this on different files 20000 times.
Are there any better alternatives?
Also, are there are other performance enhancements for simple looping, as my code has nested loops.
If the files aren’t too huge, the linecache module of the standard library is pretty good — it lets you very directly ask for the Nth line of such-and-such file.
If the files are huge, I recommend something like (warning, untested code):
The general idea is to read in the file as binary, in large blocks (at least as large as the longest possible line) — the processing (on Windows) from binary to “text” is costly on huge files — and use the fast
.countmethod of strings on most blocks. At the end we can do the line parsing on a single block (two at most in the anomalous case where the line being sought spans block boundaries).This kind of code requires careful testing and checking (which I haven’t performed in this case), being prone to off-by-one and other boundary errors, so I’d recommend it only for truly huge files — ones that would essentially bust memory if using
linecache(which just sucks up the whole file into memory instead of working by blocks). On a typical modern machine with 4GB bytes of RAM, for example, I’d start thinking about such techniques for text files that are over a GB or two.Edit: a commenter does not believe that binary reading a file is much faster than the processing required by text mode (on Windows only). To show how wrong this is, let’s use the
'U'(“universal newlines”) option that forces the line-end processing to happen on Unix machines too (as I don’t have a Windows machine to run this on;-). Using the usual kjv.txt file:(4.8 MB, 114 Klines) — about 1/1000th of the kind of file sizes I was mentioning earlier:
i.e., just about exactly a factor of 3 cost for the line-end processing (this is on an old-ish laptop, but the ratio should be pretty repeatable elsewhere, too).
Reading by a loop on lines, of course, is even slower:
and using
readlineas the commented mentioned (with less efficient buffering than directly looping on the file) is worst:If, as the question mentions, there are 20,000 files to read (say they’re all small-ish, on the order of this
kjv.txt), the fastest approach (reading each file in binary mode in a single gulp) should take about 260 seconds, 4-5 minutes, while the slowest one (based onreadline) should take about 1600 seconds, almost half an hour — a pretty significant difference for many, I’d say most, actual applications.