How would I go about applying an aggregating function (such as “sum()” or “max()“) to bins in a vector.
That is if I have:
- a vector of values x of length N
- a vector of bin tags b of length N
such that b indicates to what bin each value in x belongs.
for every possible value in b a I want to apply the aggregating function “func()” on all the values of x that belong to that bin.
>> x = [1,2,3,4,5,6]
>> b = ["a","b","a","a","c","c"]
the output should be 2 vectors (say the aggregating function is the product function):
>>(labels, y) = apply_to_bins(values = x, bins = b, func = prod)
labels = ["a","b","c"]
y = [12, 2, 30]
I want to do this as elegantly as possible in numpy (or just python), since obviously I could just “for loop” over it.
With
pandasgroupbythis would beUsing the example of the OP:
I was just interessted in the speed and so I compared
with_pandas_groupbywith some functions given in the answer of senderle.apply_to_bins_groupby
apply_to_bins3
with_pandas_groupby
So
pandasis the fastest for large item size. Further more the number of levels (bins) has no big influence on computation time.(Note that the time is calculated starting from numpy arrays and the time to create the
pandas.Seriesis included)I generated the data with:
And then run the benchmark in
ipythonlike this: