I am reading the weka implementation on re-sampling an array based on a given weight vector. After reading through the code, I am not sure what’s the algorithm underlying this implementation. In addition, I am quite confusing on the usage of these two lines of code:
Utils.normalize(probabilities, sumProbs / sumOfWeights);
and
// Make sure that rounding errors don't mess things up
probabilities[numInstances() - 1] = sumOfWeights;
I do not know what they are used for. The following is the code copied from Weka
Instances weka::core::Instances::resampleWithWeights(Random random,double[] weights )
{
if (weights.length != numInstances()) {
throw new IllegalArgumentException("weights.length != numInstances.");
}
Instances newData = new Instances(this, numInstances());
if (numInstances() == 0) {
return newData;
}
double[] probabilities = new double[numInstances()];
double sumProbs = 0, sumOfWeights = Utils.sum(weights);
for (int i = 0; i < numInstances(); i++) {
sumProbs += random.nextDouble();
probabilities[i] = sumProbs;
}
Utils.normalize(probabilities, sumProbs / sumOfWeights);
// Make sure that rounding errors don't mess things up
probabilities[numInstances() - 1] = sumOfWeights;
int k = 0; int l = 0;
sumProbs = 0;
while ((k < numInstances() && (l < numInstances()))) {
if (weights[l] < 0) {
throw new IllegalArgumentException("Weights have to be positive.");
}
sumProbs += weights[l];
while ((k < numInstances()) &&
(probabilities[k] <= sumProbs)) {
newData.add(instance(l));
newData.instance(k).setWeight(1);
k++;
}
l++;
}
return newData;
}
The first code fragment:
just divides each element of
probabilitiesby the second argument. This convertsprobabilitiesfrom an array that has maximum element ofsumProbsto one that has a maximum element ofsumOfWeights. The second piece of code:just ensures that the last (maximum) element actually is
sumOfWeightsand wasn’t thrown off by some sort of rounding error.EDIT Here’s the theory about how the entire method works. The first half (up to the declaration of
kandl) generatesprobabilitiesas a vector of (not independent) random numbers that are increasing and the last of which is the sum of weights. This is a random partition of the interval [0, sumOfWeights]. Now the weights themselves are a partition of the same interval. Implicitly, each existing instance is assigned to one each element of the weight-based partition.The second half of the method simply steps along the weights partition (using index
l). It samples thelth instance as many times as the random partition falls in the indicated weight partition. I realize that this explanation is a little awkwardly worded. Perhaps a picture of what’s going on will help:The second half of the method simply counts how many random partition boundaries (denoted by
*) are in each weight interval (bounded by^). A little consideration should convince you that this is a valid method of randomly sampling with replacement according to the given weights.