I grabbed the KDD track1 dataset from Kaggle and decided to load a ~2.5GB

Question

0

Asked: June 2, 20262026-06-02T14:07:19+00:00 2026-06-02T14:07:19+00:00

I grabbed the KDD track1 dataset from Kaggle and decided to load a ~2.5GB

0

I grabbed the KDD track1 dataset from Kaggle and decided to load a ~2.5GB 3-column CSV file into memory, on my 16GB high-memory EC2 instance:

data = np.loadtxt('rec_log_train.txt')

the python session ate up all my memory (100%), and then got killed.

I then read the same file using R (via read.table) and it used less than 5GB of ram, which collapsed to less than 2GB after I called the garbage collector.

My question is why did this fail under numpy, and what’s the proper way of reading a file into memory. Yes I can use generators and avoid the problem, but that’s not the goal.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-02T14:07:22+00:00

import pandas, re, numpy as np

def load_file(filename, num_cols, delimiter='\t'):
    data = None
    try:
        data = np.load(filename + '.npy')
    except:
        splitter = re.compile(delimiter)

        def items(infile):
            for line in infile:
                for item in splitter.split(line):
                    yield item

        with open(filename, 'r') as infile:
            data = np.fromiter(items(infile), float64, -1)
            data = data.reshape((-1, num_cols))
            np.save(filename, data)

    return pandas.DataFrame(data)

This reads in the 2.5GB file, and serializes the output matrix. The input file is read in “lazily”, so no intermediate data-structures are built and minimal memory is used. The initial load takes a long time, but each subsequent load (of the serialized file) is fast. Please let me if you have tips!

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I grabbed the KDD track1 dataset from Kaggle and decided to load a ~2.5GB

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply