Assume we’re frequently sampling a particular value and want to keep statistics on the samples. The simplest approach is to store every sample so we can calculate whatever stats we want, but this requires unbounded storage. Using a constant amount of storage, we can keep track of some stats like minimum and maximum values. What else can we track using only constant storage? I am thinking of percentiles, standard deviation, and any other useful statistics.
That’s the theoretical question. In my actual situation, the samples are simply millisecond timings: profiling information for a long-running application. There will be millions of samples but not much more than a billion or so. So what stats can be kept for the samples using no more than, say, 10 variables?
Minimum, maximum, average, total count, variance are all easy and useful. That’s 5 values. Usually you’ll store sum and not average, and when you need the average you can just divide the sum by the count.
So, in your loop
later, you may print any of these stats. Mean and standard deviation can be computed at any time and are:
Median and percentile estimation are more difficult, but possible. The usual trick is to make a set of histogram bins and fill the bins when a sample is found inside them. You can then estimate median and such by looking at the distribution of those bin populations. This is only an approximation to the distribution, but often enough. To find the exact median, you must store all samples.