Right now I am writing some Python code to deal with massive twitter files. These files are so big that they can’t fit into memory. To work with them, I basically have two choices.
-
I could split the files into smaller files that can fit into memory.
-
I could process the big file line by line so I never need to fit the entire file into memory at once. I would prefer the latter for ease of implementation.
However, I am wondering if it is faster to read in an entire file to memory and then manipulate it from there. It seems like it could be slow to constantly be reading a file line by line from disk. But then again, I do not fully understand how these processes work in Python. Does anyone know if line by line file reading will cause my code to be slower than if I read the entire file into memory and just manipulate it from there?
For really fast file reading, have a look at the mmap module. This will make the entire file appear as a big chunk of virtual memory, even if it’s much larger than your available RAM. If your file is bigger than 3 or 4 gigabytes, then you’ll want to be using a 64-bit OS (and 64-bit build of Python).
I’ve done this for files over 30 GB in size with good results.