I am currently working on reimplementing some algorithm written in Java in Python. One step is to calculate the standard deviation of a list of values. The original implementation uses DescriptiveStatistics.getStandardDeviation from the Apache Math 1.1 library for this. I use the standard deviation of numpy 1.5. The problem is, they give (very) different results for the same input. The sample I have is this:
[0.113967640255, 0.223095775796, 0.283134228235, 0.416793887842]
I get the following results:
numpy : 0.10932134388775223
Apache Math 1.1 : 0.12620366805397404
Wolfram Alpha : 0.12620366805397404
I checked with Wolfram Alpha to get a third opinion. I do not think that such a difference can be explained by precision alone. Does anyone have any idea why this is happening, and what I could do about it?
Edit: Calculating it manually in Python gives the same result:
>>> from math import sqrt
>>> v = [0.113967640255, 0.223095775796, 0.283134228235, 0.416793887842]
>>> mu = sum(v) / 4
>>> sqrt(sum([(x - mu)**2 for x in v]) / 4)
0.10932134388775223
Also, about not using it right:
>>> from numpy import std
>>> std([0.113967640255, 0.223095775796, 0.283134228235, 0.416793887842])
0.10932134388775223
Apache and Wolfram divide by N-1 rather than N. This is a degrees of freedom adjustment, since you estimate μ. By dividing by N-1 you obtain an unbiased estimate of the population standard deviation. You can change NumPy’s behavior using the
ddofoption.This is described in the NumPy documentation: