I have a text files that lists pairs, for example
10,1
2,7
3,1
10,1
That has then been turned into a symmetric matrix, so the (1,10) entry is the number of times the pair (1,10) showed up on the list. I would now like to subsample this matrix. By subsample I mean – I would like to make a matrix that would have been the result of only using a random 30% of the lines in the original text file. So in this example, had I erased 70% of the text file, the (1,10) pair might only show up once instead of twice, and so the (1,10) entry in the matrix would be 1 instead of 2.
This can be done easily if I actually have the original text file, by just using random.sample to pick out 30% of the lines in the files. But if I only have the matrix, how can I randomly decimate 70% of the data?
I guess the best way depends on where your data is large:
Here’s a solution that will be suited to the second case, though it will also work
OK in the first case.
Basically, the fact that the counts happen to be in a 2D matrix is not so
important: this is basically the problem of sampling from a population that has
been binned. So what we can do is extract the bins directly, and forget about the
matrix for a bit:
And then sample from those bins, using a cumulative array of counts:
Now you will have the sampled matrix in
out_mat.