Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6350917
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 24, 20262026-05-24T21:54:08+00:00 2026-05-24T21:54:08+00:00

I’m using k-means with matlab on a big and sparse matrix ~(1000000×1000). Now here

  • 0

I’m using k-means with matlab on a big and sparse matrix ~(1000000×1000). Now here is the problem – using cosine similarity as the distance function I get the “Out of memory. Type HELP MEMORY for your options” msg within a few minutes. However, if I use euclidean distance it runs perfectly (same matrix).

This is a bit strange since the distance is computed pairwise and shouldn’t require more than a small constant memory per distance computation.

Cosine works great when using k-means on a smaller matrix (1000×1000, though not as sparse).

Technical details:
The machine is 64 bit with 8GB RAM.
If you want to try: the matrix can be found here (it’s on sendspace, so it’ll be available for a few weeks).

The file is in sparse format: [row]\t[column]\t[value]\n

the matlab code:

f=load(filename);
v=spconvert(f);
c=kmeans(v,9);
c=kmeans(v,9,'distance','cosine');
  1. Any idea regarding the difference in memory usage btw. cosine and euclidean distances?

  2. Any idea as to how to approach it and actually use cosine on a big matrix?

Thanks!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-24T21:54:08+00:00Added an answer on May 24, 2026 at 9:54 pm

    If you inspect the kmeans.m function, then the code for the cosine-distance boils down to two critical sections that might throw an out-of-memory errors. First let me introduce the main variables involved:

    • X: rows are observation vectors, columns are dimensions (data)
    • C: rows are centroids, columns are dimensions (cluster centroids)

    1)

    The first piece of code is normalizing the data rows to unit length (this was previously pointed out in @John‘s deleted answer, though for the wrong reasons):

    [n,p] = size(X);              %# in your case, X is a matrix of size 1000000x1000
    
    Xnorm = sqrt(sum(X.^2, 2));   %# norm of each instance vector
    X = X ./ Xnorm(:,ones(1,p));  %# normalize to unit length
    

    The above tries to vectorize the operation using ONE-indexing to repeat the norm vector by as many columns the data have, then doing an element-wise division. Just check the variable sizes to understand the problem with such approach:

    >> whos X Xnorm
      Name             Size                 Bytes  Class     Attributes
    
      X          1017564x1000            83056640  double    sparse    
      Xnorm      1017564x1               12210776  double    sparse    
    

    Thus Xnorm(:,ones(1,p)) will attempt to allocate a temporary matrix of size 12210776*1000 bytes = 11.3722 GB which is clearly what causes the out-of-memory error…

    (For those interested, a double sparse matrix X internally needs 12*nnz(X) + 4*size(X,2) bytes for storage, while the full representation takes prod(size(X))*8 bytes. In your case, that’s around 80MB vs 11.5GB of memory needed!)

    This line could have been written in a different (probably slower) way, that avoids the huge space requirement that is usually a downside of vectorization. Simply loop over each row and divide by the norm. Even better, we can use the BSXFUN function which was specifically designed for such cases (avoiding the use of REPMAT and indexing tricks):

    X = bsxfun(@rdivide, X, Xnorm);
    

    The funny thing is that there are commented sections of code in other places of the KMEANS file, where this issue was clearly considered, and thus opted to use a slower for-loop, yet guaranteed not to run out of memory…

    2)

    The second critical section is where the actual computation of the distance occurs. The code of interest is the following:

    n = size(X,1);
    nclusts = size(C,1);
    
    D = zeros(n,nclusts);
    for i = 1:nclusts
        D(:,i) = max(1 - X*C(i,:)', 0);
    end
    

    Basically it computes the inner-product of each data instance with every cluster centroid (one centroid at a time, against the entire data vectors). Again, in the chance that this causes an issue, you could simply unroll the vectorized product into a step-by-step loop, with something like:

    for i = 1:nclusts
        for j = 1:n
            D(j,i) = max(1 - dot(X(j,:),C(i,:)), 0);
        end
    end
    

    So you get the idea; When your matrices are really big, you have to be careful about operations that would create large intermediate results, and replace them when possible with an explicit loop which operates on smaller scales.


    BTW, you are not experiencing the same problems when using the euclidean distance because it was written with a loop instead of a single-line vectorized solution. Here is the section that computes the distance function:

    for i = 1:nclusts
        D(:,i) = (X(:,1) - C(i,1)).^2;
        for j = 2:p
            D(:,i) = D(:,i) + (X(:,j) - C(i,j)).^2;
        end
        % D(:,i) = sum((X - C(repmat(i,n,1),:)).^2, 2);    %# <--- commented code
    end
    

    Still, I am surprised that the BSXFUN was again not used instead:

    for i=1:nclusts
        D(:,i) = sum(bsxfun(@minus, X, C(i,:)).^2, 2);
    end
    

    Please note that I haven’t attempted to cluster the entire data until completion. I am running on a 32-bit machine with 4GB (of which MATLAB can only access 3GB due to architecture restrictions), so please report back whether the proposed changes actually make a difference on your 64-bit/8GB hardware 😉

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm new to using the Perl treebuilder module for HTML parsing and can't figure
That's pretty much it. I'm using Nokogiri to scrape a web page what has
link Im having trouble converting the html entites into html characters, (&# 8217;) i
I have a jquery bug and I've been looking for hours now, I can't
this is what i have right now Drawing an RSS feed into the php,
I'm using v2.0 of ClassTextile.php, with the following call: $testimonial_text = $textile->TextileRestricted($_POST['testimonial']); ... and
I am currently running into a problem where an element is coming back from
We're building an app, our first using Rails 3, and we're having to build
I'm parsing an RSS feed that has an &#8217; in it. SimpleXML turns this
We are using XSLT to translate a RIXML file to XML. Our RIXML contains

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.