Using DataFrame (pandas as pd, numpy as np):
test = pd.DataFrame({'A' : [10,11,12,13,15,25,43,70],
'B' : [1,2,3,4,5,6,7,8],
'C' : [1,1,1,1,2,2,2,2]})
In [39]: test
Out[39]:
A B C
0 10 1 1
1 11 2 1
2 12 3 1
3 13 4 1
4 15 5 2
5 25 6 2
6 43 7 2
7 70 8 2
Grouping DF by ‘C’ and aggregating with np.mean (also sum, min, max) produces column-wise aggregation within groups:
In [40]: test_g = test.groupby('C')
In [41]: test_g.aggregate(np.mean)
Out[41]:
A B
C
1 11.50 2.5
2 38.25 6.5
However, it looks like aggregating using np.median produces DataFrame-wise aggregation within groups:
In [42]: test_g.aggregate(np.median)
Out[42]:
A B
C
1 7.0 7.0
2 11.5 11.5
(using groupby.median method seems to produce expected column-wise results though)
I would appreciate addressing following issues:
- What is the reason/mechanism of such an outcome?
- If this behaviour is confirmed, how does it affect recommended “best practices” of aggregating groupings? Could other aggregation functions work this way?
The reason is quite funny. Probably some pandas specialists would want to chime in, but it comes down to a ping-pong between numpy and pandas. Note that the documentation says:
The first thing is a 2D (array_like) the second method comes down to 1D array_likes being passed to the function you give in.
This means aggregate passes first the 2D series in. In the first case (
np.mean), numpy knows that arrays have a.meanattribute, so it does what it always does it calls this. However it calls it withaxis=None(default for numpy). This makes Pandas throw an Exception (it wants axis to be 0 or 1 and never None) and it goes to the second step, which passes it as 1D and is foolproof.However, when you give in
np.mediannumpy arrays do not have the.medianattribute, so it does the normal numpy machinery, which is to flatten the array (ie, typicallyaxis=None).The workaround would be to use
test_g.aggregate([np.median, np.median])to force it to always take the second path. or what would work too:test_g.aggregate(np.median, axis=0)which passes theaxis=0on intonp.medianand thus tells numpy how to handle it correctly. In generally I wonder if pandas should not at least throw a warning, afterall broadcasting the result to both columns should be almost never what is wanted.