I’m using k-means with matlab on a big and sparse matrix ~(1000000×1000). Now here

Question

0

Asked: May 24, 20262026-05-24T21:54:08+00:00 2026-05-24T21:54:08+00:00

I’m using k-means with matlab on a big and sparse matrix ~(1000000×1000). Now here

0

I’m using k-means with matlab on a big and sparse matrix ~(1000000×1000). Now here is the problem – using cosine similarity as the distance function I get the “Out of memory. Type HELP MEMORY for your options” msg within a few minutes. However, if I use euclidean distance it runs perfectly (same matrix).

This is a bit strange since the distance is computed pairwise and shouldn’t require more than a small constant memory per distance computation.

Cosine works great when using k-means on a smaller matrix (1000×1000, though not as sparse).

Technical details:
The machine is 64 bit with 8GB RAM.
If you want to try: the matrix can be found here (it’s on sendspace, so it’ll be available for a few weeks).

The file is in sparse format: [row]\t[column]\t[value]\n

the matlab code:

f=load(filename);
v=spconvert(f);
c=kmeans(v,9);
c=kmeans(v,9,'distance','cosine');

Any idea regarding the difference in memory usage btw. cosine and euclidean distances?
Any idea as to how to approach it and actually use cosine on a big matrix?

Thanks!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-24T21:54:08+00:00

If you inspect the kmeans.m function, then the code for the cosine-distance boils down to two critical sections that might throw an out-of-memory errors. First let me introduce the main variables involved:

X: rows are observation vectors, columns are dimensions (data)
C: rows are centroids, columns are dimensions (cluster centroids)

1)

The first piece of code is normalizing the data rows to unit length (this was previously pointed out in @John‘s deleted answer, though for the wrong reasons):

[n,p] = size(X);              %# in your case, X is a matrix of size 1000000x1000

Xnorm = sqrt(sum(X.^2, 2));   %# norm of each instance vector
X = X ./ Xnorm(:,ones(1,p));  %# normalize to unit length

The above tries to vectorize the operation using ONE-indexing to repeat the norm vector by as many columns the data have, then doing an element-wise division. Just check the variable sizes to understand the problem with such approach:

>> whos X Xnorm
  Name             Size                 Bytes  Class     Attributes

  X          1017564x1000            83056640  double    sparse    
  Xnorm      1017564x1               12210776  double    sparse

Thus Xnorm(:,ones(1,p)) will attempt to allocate a temporary matrix of size 12210776*1000 bytes = 11.3722 GB which is clearly what causes the out-of-memory error…

(For those interested, a double sparse matrix X internally needs 12*nnz(X) + 4*size(X,2) bytes for storage, while the full representation takes prod(size(X))*8 bytes. In your case, that’s around 80MB vs 11.5GB of memory needed!)

This line could have been written in a different (probably slower) way, that avoids the huge space requirement that is usually a downside of vectorization. Simply loop over each row and divide by the norm. Even better, we can use the BSXFUN function which was specifically designed for such cases (avoiding the use of REPMAT and indexing tricks):

X = bsxfun(@rdivide, X, Xnorm);

The funny thing is that there are commented sections of code in other places of the KMEANS file, where this issue was clearly considered, and thus opted to use a slower for-loop, yet guaranteed not to run out of memory…

2)

The second critical section is where the actual computation of the distance occurs. The code of interest is the following:

n = size(X,1);
nclusts = size(C,1);

D = zeros(n,nclusts);
for i = 1:nclusts
    D(:,i) = max(1 - X*C(i,:)', 0);
end

Basically it computes the inner-product of each data instance with every cluster centroid (one centroid at a time, against the entire data vectors). Again, in the chance that this causes an issue, you could simply unroll the vectorized product into a step-by-step loop, with something like:

for i = 1:nclusts
    for j = 1:n
        D(j,i) = max(1 - dot(X(j,:),C(i,:)), 0);
    end
end

So you get the idea; When your matrices are really big, you have to be careful about operations that would create large intermediate results, and replace them when possible with an explicit loop which operates on smaller scales.

BTW, you are not experiencing the same problems when using the euclidean distance because it was written with a loop instead of a single-line vectorized solution. Here is the section that computes the distance function:

for i = 1:nclusts
    D(:,i) = (X(:,1) - C(i,1)).^2;
    for j = 2:p
        D(:,i) = D(:,i) + (X(:,j) - C(i,j)).^2;
    end
    % D(:,i) = sum((X - C(repmat(i,n,1),:)).^2, 2);    %# <--- commented code
end

Still, I am surprised that the BSXFUN was again not used instead:

for i=1:nclusts
    D(:,i) = sum(bsxfun(@minus, X, C(i,:)).^2, 2);
end

Please note that I haven’t attempted to cluster the entire data until completion. I am running on a 32-bit machine with 4GB (of which MATLAB can only access 3GB due to architecture restrictions), so please report back whether the proposed changes actually make a difference on your 64-bit/8GB hardware 😉

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m using k-means with matlab on a big and sparse matrix ~(1000000×1000). Now here

Leave an answerCancel reply

1 Answer

1)

2)

Leave an answer
Cancel reply