I am using Weka’s SimpleKMeans function to cluster 96000 terms(as word). Weka takes the

Question

0

Asked: June 14, 20262026-06-14T17:05:37+00:00 2026-06-14T17:05:37+00:00

I am using Weka’s SimpleKMeans function to cluster 96000 terms(as word). Weka takes the

0

I am using Weka’s SimpleKMeans function to cluster 96000 terms(as word). Weka takes the number of desired cluster number as parameter. So, it gives 2 to num. of clusters default.
The dataset I have is 96000×641000 sparse dataset. At the beginning I gave thu cluster number 10000 but I think it is too much for recommendation process.
Is there an approach to calculate #of clusters respect to an algorithm or find the ideal #of clusters?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T17:05:38+00:00

K-means is not really designed for sparse data. Plus, it is designed for euclidean distance, and you should be aware that this is not a good choice for high-dimensional data.

Maybe the simplest argument is as follows: The mean of a subset will likely no longer be sparse, so it will be anomalous itself, and closer to the center than the actual data instances. This however means that the means of different clusters will likely be closer to each other than the actual instances to their means, which makes the result highly dubious.

You should at least try k-medians instead (but it is a lot slower), or other measures to preserve sparsity for the means, too. Sure: k-means does cluster the data. The question is, how valid the result is.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am using Weka’s SimpleKMeans function to cluster 96000 terms(as word). Weka takes the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply