Suppose I have the following frequency table.
> print(dat)
V1 V2
1 1 11613
2 2 6517
3 3 2442
4 4 687
5 5 159
6 6 29
# V1 = Score
# V2 = Frequency
How can I efficiently compute the Mean and standard deviation?
Yielding: SD=0.87 MEAN=1.66.
Replicating the score by frequency takes too long to compute.
Mean is easy. SD is a little trickier (can’t just use fastmean() again because there’s an n-1 in the denominator.
Note that
fastRMSEcalculatessum(freq)twice. Eliminating this would probably result in another minor speed boost.Benchmarking
The replicating solution appears to be O( N^2 ) in the number of entries.
The
fastmeansolution appears to have a 12ms or so fixed cost after which it scales beautifully.More benchmarking