I have a list of sets and some basic statistics for each one (number of items, min, max, mean, stddev). I would like to calculate the same statistics for all of the sets combined. Calculating the total count, min max and mean is easy, but I’m unsure how to calculate the total standard deviation.
The data looks like this:
Count Max Min Mean Stddev
1,027,671 781 68 57.8 32.79
839,473 552 54 61.3 48.53
3,012,102 890 41 64.9 41.92
Generating the statistics for all of the sets together:
4,879,246 890 41 62.8 ???
I assume you are writing the code that maintains the distribution, and not just consuming some data that already has the standard deviation computed. The standard dev isn’t a really natural parameter to maintain for a computer. Instead, You should maintain the number of items, the sum, and the sum of the items squared, and then you easily compute the mean and standard deviation the distribution from those 3 pieces of raw information. I use this strategy in this code here. The add operation supports merging two distributions. Notice how simple its implementation is. http://github.com/rrenaud/dominionstats/blob/master/stats.py#L17.