I am comparing two distributions, such as:
group1 = [ 0, 0, 0, 1, 11, 11, 13, 12]
group2 = [ 0, 0, 0, 0, 5, 11, 18, 14]
My distributions don’t have a lot of elements, and I am not sure if chi-square is the best approach, but from what I read I think it is still the best of those tests which I have seen.
The problem is, that whichever chi-square I try, I am getting different results:
so that if I use:
import numpy as np
import scipy.stats.mstats as mst
mst.chisquare(np.array(group1), np.array(group2))
the answer will be: (8.874603174603175, 0.26178489290758555)
If I use:
import scipy.stats as stat
stat.chisquare(np.array(group1), np.array(group2))
I will get: (nan, nan)
And if I remove all the elements which are 0 in both groups so that my groups will now look as such:
group1 = [ 1, 11, 11, 13, 12]
group2 = [ 0, 5, 11, 18, 14]
using:
mst.chisquare(np.array(group1), np.array(group2))
will give me: (8.874603174603175, 0.06431137995249224)
I am very confused with this ambiguity. What is the true p-value for my distributions?
I guess it is a bug in the
scipy.stats.mstatsmodule.mstatsis supposed to handle masked arrays (arrays with invalid values) better thanstats. However it seems that in this case it does not count correctly the number of degrees of freedom (DOF): The chi-square statistics (the first return value ofchisquare) is the same before and after removing the zeros, so only DOF could change.Note that after removing the 0s in both arrays you will still get infinities because to calculate chi-square statistics you have to divide by frequencies in
group2(group2in you array, see Wikipedia).mstatremoves these invalid values, but it won’t adapt the DOF accordingly (because there is less elements the dof should be decreased by the difference of elements).I hope it clarifies it a bit. Please consider sending a bug report to scipy discussion list.