I’m running kmeans on a large dataset and I’m always getting the error below:
Error using kmeans (line 145)
Some points have small relative magnitudes, making them effectively zero.
Either remove those points, or choose a distance other than 'cosine'.
Error in runkmeans (line 7)
[L, C]=kmeans(data, 10, 'Distance', 'cosine', 'EmptyAction', 'drop')
My problem is that even when I add a 1 to all the vectors, I still get this error. I would expect it to pass then, but apparently there are too many zero’s still (that is what is causing it, right?).
My question is this: what is the condition that makes Matlab decide that a point has “a small relative magnitude” and “is effectively zero”?
I want to remove all these points from my dataset using python, before I hand over the data to Matlab, because I need to compare my results with a gold standard that I process in python.
Thanks in advance!
EDIT-ANSWER
The correct answer was given below, but in case someone finds this question through Google, here’s how you remove the “effectively zero-vectors” from your matrix in python. Every row (!) is a data point, so you want to transpose in python or Matlab if you’re running kmeans:
def getxnorm(data):
return np.sqrt(np.sum(data ** 2, axis=1))
def remove_zero_vector(data, startxnorm, excluded=[]):
eps = 2.2204e-016
xnorm = getxnorm(data)
if np.min(xnorm) <= (eps * np.max(xnorm)):
local_index=np.transpose(np.where(xnorm == np.min(xnorm)))[0][0]
global_index=np.transpose(np.where(startxnorm == np.min(xnorm)))[0][0]
data=np.delete(data, local_index, 0) # data with zero vector removed
excluded.append(global_index) # add global index to list of excluded vectors
return remove_zero_vector(data, startxnorm, excluded)
else:
return (data, excluded)
I’m sure there’s a much more scipythonic way for doing this, but it’ll do 🙂
If you’re using this kmeans, then the relevant code that is throwing the error is:
So there’s your test.
As you can see, what’s important is relative size, so adding one to everything only makes things worse (
max(Xnorm)is getting larger too). A good fix might be to scale all the data by a constant.