I have a text files that lists pairs, for example 10,1 2,7 3,1 10,1

Question

0

Asked: June 9, 20262026-06-09T11:47:50+00:00 2026-06-09T11:47:50+00:00

I have a text files that lists pairs, for example 10,1 2,7 3,1 10,1

0

I have a text files that lists pairs, for example

10,1
2,7
3,1
10,1

That has then been turned into a symmetric matrix, so the (1,10) entry is the number of times the pair (1,10) showed up on the list. I would now like to subsample this matrix. By subsample I mean – I would like to make a matrix that would have been the result of only using a random 30% of the lines in the original text file. So in this example, had I erased 70% of the text file, the (1,10) pair might only show up once instead of twice, and so the (1,10) entry in the matrix would be 1 instead of 2.

This can be done easily if I actually have the original text file, by just using random.sample to pick out 30% of the lines in the files. But if I only have the matrix, how can I randomly decimate 70% of the data?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-09T11:47:52+00:00

I guess the best way depends on where your data is large:

Do you have a huge matrix, with mostly small counts in it? or
Do you have a moderately sized matrix with huge numbers of counts in it?

Here’s a solution that will be suited to the second case, though it will also work
OK in the first case.

Basically, the fact that the counts happen to be in a 2D matrix is not so
important: this is basically the problem of sampling from a population that has
been binned. So what we can do is extract the bins directly, and forget about the
matrix for a bit:

import numpy as np
import random

# Input counts matrix
mat = np.array([
    [5, 5, 2],
    [1, 1, 3],
    [6, 0, 4]
], dtype=np.int64)

# Build a list of (row,col) pairs, and a list of counts
keys, counts = zip(*[
    ((i,j), mat[i,j])
        for i in range(mat.shape[0])
        for j in range(mat.shape[1])
        if mat[i,j] > 0
])

And then sample from those bins, using a cumulative array of counts:

# Make the cumulative counts array
counts = np.array(counts, dtype=np.int64)
sum_counts = np.cumsum(counts)

# Decide how many counts to include in the sample
frac_select = 0.30
count_select = int(sum_counts[-1] * frac_select)

# Choose unique counts
ind_select = sorted(random.sample(xrange(sum_counts[-1]), count_select))

# A vector to hold the new counts
out_counts = np.zeros(counts.shape, dtype=np.int64)

# Perform basically the merge step of merge-sort, finding where
# the counts land in the cumulative array
i = 0
j = 0
while i<len(sum_counts) and j<len(ind_select):
    if ind_select[j] < sum_counts[i]:
        j += 1
        out_counts[i] += 1
    else:
        i += 1

# Rebuild the matrix using the `keys` list from before
out_mat = np.zeros(mat.shape, dtype=np.int64)
for i in range(len(out_counts)):
    out_mat[keys[i]] = out_counts[i]

Now you will have the sampled matrix in out_mat.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a text files that lists pairs, for example 10,1 2,7 3,1 10,1

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply