I’m using k-means with matlab on a big and sparse matrix ~(1000000×1000). Now here is the problem – using cosine similarity as the distance function I get the “Out of memory. Type HELP MEMORY for your options” msg within a few minutes. However, if I use euclidean distance it runs perfectly (same matrix).
This is a bit strange since the distance is computed pairwise and shouldn’t require more than a small constant memory per distance computation.
Cosine works great when using k-means on a smaller matrix (1000×1000, though not as sparse).
Technical details:
The machine is 64 bit with 8GB RAM.
If you want to try: the matrix can be found here (it’s on sendspace, so it’ll be available for a few weeks).
The file is in sparse format: [row]\t[column]\t[value]\n
the matlab code:
f=load(filename);
v=spconvert(f);
c=kmeans(v,9);
c=kmeans(v,9,'distance','cosine');
-
Any idea regarding the difference in memory usage btw. cosine and euclidean distances?
-
Any idea as to how to approach it and actually use cosine on a big matrix?
Thanks!
If you inspect the
kmeans.mfunction, then the code for the cosine-distance boils down to two critical sections that might throw an out-of-memory errors. First let me introduce the main variables involved:X: rows are observation vectors, columns are dimensions (data)C: rows are centroids, columns are dimensions (cluster centroids)1)
The first piece of code is normalizing the data rows to unit length (this was previously pointed out in @John‘s deleted answer, though for the wrong reasons):
The above tries to vectorize the operation using ONE-indexing to repeat the norm vector by as many columns the data have, then doing an element-wise division. Just check the variable sizes to understand the problem with such approach:
Thus
Xnorm(:,ones(1,p))will attempt to allocate a temporary matrix of size12210776*1000 bytes = 11.3722 GBwhich is clearly what causes the out-of-memory error…(For those interested, a double sparse matrix
Xinternally needs12*nnz(X) + 4*size(X,2) bytesfor storage, while the full representation takesprod(size(X))*8 bytes. In your case, that’s around 80MB vs 11.5GB of memory needed!)This line could have been written in a different (probably slower) way, that avoids the huge space requirement that is usually a downside of vectorization. Simply loop over each row and divide by the norm. Even better, we can use the BSXFUN function which was specifically designed for such cases (avoiding the use of REPMAT and indexing tricks):
The funny thing is that there are commented sections of code in other places of the KMEANS file, where this issue was clearly considered, and thus opted to use a slower for-loop, yet guaranteed not to run out of memory…
2)
The second critical section is where the actual computation of the distance occurs. The code of interest is the following:
Basically it computes the inner-product of each data instance with every cluster centroid (one centroid at a time, against the entire data vectors). Again, in the chance that this causes an issue, you could simply unroll the vectorized product into a step-by-step loop, with something like:
So you get the idea; When your matrices are really big, you have to be careful about operations that would create large intermediate results, and replace them when possible with an explicit loop which operates on smaller scales.
BTW, you are not experiencing the same problems when using the euclidean distance because it was written with a loop instead of a single-line vectorized solution. Here is the section that computes the distance function:
Still, I am surprised that the BSXFUN was again not used instead:
Please note that I haven’t attempted to cluster the entire data until completion. I am running on a 32-bit machine with 4GB (of which MATLAB can only access 3GB due to architecture restrictions), so please report back whether the proposed changes actually make a difference on your 64-bit/8GB hardware 😉