I need to hold a 50,000×50,000 sparse matrix/2d-array, with ~5% of the cells, uniformly distributed, being non-empty. I will need to:
edit I need to do this in numpy/scipy, sorry if wasn’t clear. Also, added requirements.
- Read the 5% non-empty data from a DB, and assign it to matrix/2d-array cells, as quickly as possible.
- Use as little memory as possible.
- Use fancy indexing (take the indexes of and all non-empty values in a column, say). This is nice-to-have, memory and construction-time as more important.
- Once constructed, the matrix will not change.
- I will, however, want to take its transpose, with preferably O(1) memory and time.
What’s the most efficient way of achieving this?
Can I hold nan’s instead of zeros to indicate “empty” cells? (0 is a valid value for me), and can I efficiently run nansum, nanmean?
If not, can I efficiently take the index of and values of all non-zeros in a given column/row?
Well, for my purposes it seems like csc is the way to go. With 5% “sparsity factor”, the memory that the row indexes in csc take is still worth it. Here’s the code I used to test that the stuff I need really is fast:
Running this in
%timeitshows that this is indeed fast.