How would I go about applying an aggregating function (such as sum() or max()

Question

0

Editorial Team

Asked: June 9, 20262026-06-09T03:04:57+00:00 2026-06-09T03:04:57+00:00

How would I go about applying an aggregating function (such as sum() or max()

0

How would I go about applying an aggregating function (such as “sum()” or “max()“) to bins in a vector.

That is if I have:

a vector of values x of length N
a vector of bin tags b of length N

such that b indicates to what bin each value in x belongs.
for every possible value in b a I want to apply the aggregating function “func()” on all the values of x that belong to that bin.

>> x = [1,2,3,4,5,6]
>> b = ["a","b","a","a","c","c"]

the output should be 2 vectors (say the aggregating function is the product function):

>>(labels, y) = apply_to_bins(values = x, bins = b, func = prod)

labels = ["a","b","c"]
y = [12, 2, 30]

I want to do this as elegantly as possible in numpy (or just python), since obviously I could just “for loop” over it.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-09T03:04:58+00:00

With pandas groupby this would be

import pandas as pd

def with_pandas_groupby(func, x, b):
    grouped = pd.Series(x).groupby(b)
    return grouped.agg(func)

Using the example of the OP:

>>> x = [1,2,3,4,5,6]
>>> b = ["a","b","a","a","c","c"]
>>> with_pandas_groupby(np.prod, x, b)
a    12
b     2
c    30

I was just interessted in the speed and so I compared with_pandas_groupby with some functions given in the answer of senderle.

apply_to_bins_groupby

 3 levels,      100 values: 175 us per loop
 3 levels,     1000 values: 1.16 ms per loop
 3 levels,  1000000 values: 1.21 s per loop

10 levels,      100 values: 304 us per loop
10 levels,     1000 values: 1.32 ms per loop
10 levels,  1000000 values: 1.23 s per loop

26 levels,      100 values: 554 us per loop
26 levels,     1000 values: 1.59 ms per loop
26 levels,  1000000 values: 1.27 s per loop

apply_to_bins3

 3 levels,      100 values: 136 us per loop
 3 levels,     1000 values: 259 us per loop
 3 levels,  1000000 values: 205 ms per loop

10 levels,      100 values: 297 us per loop
10 levels,     1000 values: 447 us per loop
10 levels,  1000000 values: 262 ms per loop

26 levels,      100 values: 617 us per loop
26 levels,     1000 values: 795 us per loop
26 levels,  1000000 values: 299 ms per loop

with_pandas_groupby

 3 levels,      100 values: 365 us per loop
 3 levels,     1000 values: 443 us per loop
 3 levels,  1000000 values: 89.4 ms per loop

10 levels,      100 values: 369 us per loop
10 levels,     1000 values: 453 us per loop
10 levels,  1000000 values: 88.8 ms per loop

26 levels,      100 values: 382 us per loop
26 levels,     1000 values: 466 us per loop
26 levels,  1000000 values: 89.9 ms per loop

So pandas is the fastest for large item size. Further more the number of levels (bins) has no big influence on computation time.
(Note that the time is calculated starting from numpy arrays and the time to create the pandas.Series is included)

I generated the data with:

def gen_data(levels, size):
    choices = 'abcdefghijklmnopqrstuvwxyz'
    levels = np.asarray([l for l in choices[:nlevels]])
    index = np.random.random_integers(0, levels.size - 1, size)
    b = levels[index]
    x = np.arange(1, size + 1)
    return x, b

And then run the benchmark in ipython like this:

In [174]: for nlevels in (3, 10, 26):
   .....:     for size in (100, 1000, 10e5):
   .....:         x, b = gen_data(nlevels, size)
   .....:         print '%2d levels, ' % nlevels, '%7d values:' % size,
   .....:         %timeit function_to_time(np.prod, x, b)
   .....:     print

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

How would I go about applying an aggregating function (such as sum() or max()

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply