I am looking at the k-means++ initialization algorithm. The following two steps of the algorithm give rise to non-uniform probabilities:
For each data point x, compute D(x), the distance between x and the
nearest center that has already been chosen.Choose one new data point at random as a new center, using a weighted
probability distribution where a point x is chosen with probability
proportional to D(x)^2.
How can I select with this stated weighted probability distribution in C++?
With a finite set of individual data points X, this calls for a discrete probability distribution.
The easiest way to do this is to enumerate the points X in order, and calculate an array representing their cumulative probability distribution function: (pseudocode follows)
You call prepare_cdf once, and then call select_point as many times as you need to generate random points.