How do I use scikit-learn to train a model on a large csv data (~75MB) without running into memory problems?
I’m using IPython notebook as the programming environment, and pandas+sklearn packages to analyze data from kaggle’s digit recognizer tutorial.
The data is available on the webpage , link to my code , and here is the error message:
KNeighborsClassifier is used for the prediction.
Problem:
“MemoryError” occurs when loading large dataset using read_csv
function. To bypass this problem temporarily, I have to restart the
kernel, which then read_csv function successfully loads the file, but
the same error occurs when I run the same cell again.
When the read_csv function loads the file successfully, after making changes to the dataframe, I can pass the features and labels to the KNeighborsClassifier’s fit() function. At this point, similar memory error occurs.
I tried the following:
Iterate through the CSV file in chunks, and fit the data accordingly, but the problem is that the predictive model is overwritten every time for a chunk of data.
What do you think I can do to successfully train my model without running into memory problems?
Note: when you load the data with pandas it will create a
DataFrameobject where each column has an homogeneous datatype for all the rows but 2 columns can have distinct datatypes (e.g. integer, dates, strings).When you pass a
DataFrameinstance to a scikit-learn model it will first allocate a homogeneous 2D numpy array with dtype np.float32 or np.float64 (depending on the implementation of the models). At this point you will have 2 copies of your dataset in memory.To avoid this you could write / reuse a CSV parser that directly allocates the data in the internal format / dtype expected by the scikit-learn model. You can try
numpy.loadtxtfor instance (have a look at the docstring for the parameters).Also if you data is very sparse (many zero values) it will be better to use a scipy.sparse datastructure and a scikit-learn model that can deal with such an input format (check the docstrings to know). However the CSV format itself is not very well suited for sparse data and I am not sure there exist a direct CSV-to-
scipy.sparseparser.Edit: for reference KNearestNeighborsClassifer allocate temporary distances array with shape
(n_samples_predict, n_samples_train)which is very wasteful when only(n_samples_predict, n_neighbors)is needed instead. This issue can be tracked here:https://github.com/scikit-learn/scikit-learn/issues/325