Let’s say that I routinely have to work with files with an unknown, but large, number of lines. Each line contains a set of integers (space, comma, semicolon, or some non-numeric character is the delimiter) in the closed interval [0, R], where R can be arbitrarily large. The number of integers on each line can be variable. Often times I get the same number of integers on each line, but occasionally I have lines with unequal sets of numbers.
Suppose I want to go to Nth line in the file and retrieve the Kth number on that line (and assume that the inputs N and K are valid — that is, I am not worried about bad inputs). How do I go about doing this efficiently in Python 3.1.2 for Windows?
I do not want to traverse the file line by line.
I tried using mmap, but while poking around here on SO, I learned that that’s probably not the best solution on a 32-bit build because of the 4GB limit. And in truth, I couldn’t really figure out how to simply move N lines away from my current position. If I can at least just “jump” to the Nth line then I can use .split() and grab the Kth integer that way.
The nuance here is that I don’t just need to grab one line from the file. I will need to grab several lines: they are not necessarily all near each other, the order in which I get them matters, and the order is not always based on some deterministic function.
Any ideas? I hope this is enough information.
Thanks!
Python’s
seekgoes to a byte offset in a file, not to a line offset, simply because that’s the way modern operating systems and their filesystems work — the OS/FS just don’t record or remember “line offsets” in any way whatsoever, and there’s no way for Python (or any other language) to just magically guess them. Any operation purporting to “go to a line” will inevitably need to “walk through the file” (under the covers) to make the association between line numbers and byte offsets.If you’re OK with that and just want it hidden from your sight, then the solution is the standard library module linecache — but performance won’t be any better than that of code you could write yourself.
If you need to read from the same large file multiple times, a large optimization would be to run once on that large file a script that builds and saves to disk the line number – to – byte offset correspondence (technically an “index” auxiliary file); then, all your successive runs (until the large file changes) could very speedily use the index file to navigate with very high performance through the large file. Is this your use case…?
Edit: since apparently this may apply — here’s the general idea (net of careful testing, error checking, or optimization;-). To make the index, use
makeindex.py, as follows:and then to use it, for example, the following
useindex.py:Here’s an example (on my slow laptop):
The sample file is a plain text file of King James’ Bible:
100K lines, 4.4 MB, as you can see; this takes about a quarter second to index and 50 milliseconds to read and print out three random-y lines (no doubt this can be vastly accelerated with more careful optimization and a better machine). The index in memory (and on disk too) takes 4 bytes per line of the textfile being indexed, and performance should scale in a perfectly linear way, so if you had about 100 million lines, 4.4 GB, I would expect about 4-5 minutes to build the index, a minute to extract and print out three arbitrary lines (and the 400 MB of RAM taken for the index should not inconvenience even a small machine — even my tiny slow laptop has 2GB after all;-).
You can also see that (for speed and convenience) I treat the file as binary (and assume utf8 encoding — works with any subset like ASCII too of course, eg that KJ text file is ASCII) and don’t bother collapsing
\r\ninto a single character if that’s what the file has as line terminator (it’s pretty trivial to do that after reading each line if you want).