(Not strictly programming, but a question that programmers need answered.)
I have a benchmark, X, which is made up of a lot of sub-benchmarks x1..xn. Its quite a noisy test, with the results being quite variable. To accurately benchmark, I must reduce that “variability”, which requires that I first measure the variability.
I can easily calculate the variability of each sub-benchmark, using perhaps standard deviation or variance. However, I’d like to get a single number which represents the overall variability as a single number.
My own attempt at the problem is:
sum = 0
foreach i in 1..n
calculate mean across the 60 runs of x_i
foreach j in 1..60
sum += abs(mean[i] - x_i[j])
variability = sum / 60
Best idea: ask at the statistics Stack Exchange once it hits public beta (in a week).
In the meantime: you might actually be more interested in the extremes of variability, rather than the central tendency (mean, etc.). For many applications, I imagine that there’s relatively little to be gained by incrementing the typical user experience, but much to be gained by improving the worst user experiences. Try the 95th percentile of the standard deviations and work on reducing that. Alternatively, if the typical variability is what you want to reduce, plot the standard deviations all together. If they’re approximately normally distributed, I don’t know of any reason why you couldn’t just take the mean.