I grabbed the KDD track1 dataset from Kaggle and decided to load a ~2.5GB 3-column CSV file into memory, on my 16GB high-memory EC2 instance:
data = np.loadtxt('rec_log_train.txt')
the python session ate up all my memory (100%), and then got killed.
I then read the same file using R (via read.table) and it used less than 5GB of ram, which collapsed to less than 2GB after I called the garbage collector.
My question is why did this fail under numpy, and what’s the proper way of reading a file into memory. Yes I can use generators and avoid the problem, but that’s not the goal.
This reads in the 2.5GB file, and serializes the output matrix. The input file is read in “lazily”, so no intermediate data-structures are built and minimal memory is used. The initial load takes a long time, but each subsequent load (of the serialized file) is fast. Please let me if you have tips!