I’m trying to process data obtained from a csv file using csv module in python. there are about 50 columns & 401125 rows in this. I used the following code chunk to put that data into a list
csv_file_object = csv.reader(open(r'some_path\Train.csv','rb'))
header = csv_file_object.next()
data = []
for row in csv_file_object:
data.append(row)
I can get length of this list using len(data) & it returns 401125. I can even get each individual record by calling list indices.
But when I try to get the size of the list by calling np.size(data) (I imported numpy as np) I get the following stack trace.
MemoryError Traceback (most recent call
last) in ()
—-> 1 np.size(data)C:\Python27\lib\site-packages\numpy\core\fromnumeric.pyc in size(a,
axis) 2198 return a.size 2199 except
AttributeError:
-> 2200 return asarray(a).size 2201 else: 2202 try:C:\Python27\lib\site-packages\numpy\core\numeric.pyc in asarray(a,
dtype, order)
233
234 “””
–> 235 return array(a, dtype, copy=False, order=order)
236
237 def asanyarray(a, dtype=None, order=None):MemoryError:
I can’t even divide that list into a multiple parts using list indices or convert this list into a numpy array. It give this same memory error.
how can I deal with this kind of big data sample. Is there any other way to process large data sets like this one.
I’m using ipython notebook in windows 7 professional.
As noted by @DSM in the comments, the reason you’re getting a memory error is that calling
np.sizeon a list will copy the data into an array first and then get the size.If you don’t need to work with it as a numpy array, just don’t call
np.size. If you do want numpy-like indexing options and so on, you have a few options.You could use pandas, which is meant for handling big not-necessarily-numerical datasets and has some great helpers and stuff for doing so.
If you don’t want to do that, you could define a numpy structure array and populate it line-by-line in the first place rather than making a list and copying into it. Something like:
You could also define
fieldsbased onheaderso you don’t have to manually type out all 50 column names, though you’d have to do something about specifying the data types for each.