I’ve implemented K-Means in Java and have a bit of a head scratcher. I select my initial centroids by choosing a random value in each dimension within the range of values of the data points. I’ve run into cases where this results in one or more of these centroids not ending up being the closet centroid of any data point. So what do I do for the next iteration? Just leave it at its original randomized value? Pick a new random value? Compute as an average of the other centroids? Seems like this isn’t accounted for in the original algorithm, but probably I’ve just missed something.
Share
Most implementations of k-means define initial centroids using actual data points, not random points in the bounding box drawn by the variables. However, some suggestions for solving your actual problem are below.
You could take another data-point at random and make it a new cluster centroid. This is very simple and fast to implement, and shouldn’t affect the algorithm adversely.
You could also try making a smarter initial selection of cluster centroids using kmeans++. This algorithm chooses the first centroid randomly, and picks the remaining K-1 centroids to try and maximize the inter-centroid distance. By picking smarter centroids, you are much less likely to encounter the problem of a centroid being assigned zero data points.
If you wanted to be slightly more clever clever, you could use the kmeans++ algorithm to make a new centroid whenever a centroid gets assigned zero data points.