I read in an large python array from a csv file (20332 *17009) using window7 64 bit OS machine with 12 G ram. The array has values in the half of places, like the example below. I only need the array where has values for analysis, rather than the whole array.
[0 0 0 0 0 0
0 0 0 3 8 0
0 4 2 7 0 0
0 0 5 2 0 0
0 0 1 0 0 0]
I am wondering: is it possible to ignore 0 value for analysis and save more memory?
Thanks in advance!
Given your description, a sparse representation may not be very useful to you. There are many other options, though:
Make sure your values are represented using the smallest data type possible. The example you show above is best represented as single-byte integers. Reading into a numpy array or python array will give you good control over data type.
You can trade memory for performance by only reading a part of the data at a time. If you re-write the entire dataset as binary instead of CSV, then you can use mmap to access the file as if it were already in memory (this would also make it faster to read and write).
If you really need the entire dataset in memory (and it really doesn’t fit), then some sort of compression may be necessary. Sparse matrices are an option (as larsmans mentioned in the comments, both scipy and pandas have sparse matrix implementations), but these will only help if the fraction of zero-value entries is large. Better compression options will depend on the nature of your data. Consider breaking up the array into chunks and compressing those with a fast compression algorithm like RLE, SZIP, etc.