Given two unordered arrays of same lengths a and b:
a = [7,3,5,7,5,7]
b = [0.2,0.1,0.3,0.1,0.1,0.2]
I’d like to group by the elements in a:
aResult = [7,3,5]
summing over the elements in b (Example used to summarize a probability density function):
bResult = [0.2 + 0.1 + 0.2, 0.1, 0.3 + 0.1] = [0.5, 0.1, 0.4]
Alternatively, random a and b in python:
import numpy as np
a = np.random.randint(1,10,10000)
b = np.array([1./len(a)]*len(a))
I have two approaches, which for sure are far from the lower performance boundary.
Approach 1 (at least nice and short): Time: 0.769315958023
def approach_2(a,b):
bResult = [sum(b[i == a]) for i in np.unique(a)]
aResult = np.unique(a)
Approach 2 (numpy.groupby, horribly slow) Time: 4.65299129486
def approach_2(a,b):
tmp = [(a[i],b[i]) for i in range(len(a))]
tmp2 = np.array(tmp, dtype = [('a', float),('b', float)])
tmp2 = np.sort(tmp2, order='a')
bResult = []
aResult = []
for key, group in groupby(tmp2, lambda x: x[0]):
aResult.append(key)
bResult.append(sum([i[1] for i in group]))
Update: Approach3, by Pablo. Time: 1.0265750885
def approach_Pablo(a,b):
pdf = defaultdict(int);
for x,y in zip(a,b):
pdf[x] += y
Update: Approach 4, by Unutbu. Time: 0.184849023819 [WINNER SO FAR, but a as integer only]
def unique_Unutbu(a,b):
x=np.bincount(a,weights=b)
aResult = np.unique(a)
bResult = x[aResult]
Maybe someone finds a smarter solution to this problem than me 🙂
If
ais composed of ints < 2**31-1 (that is, ifahas values that can fit in dtypeint32), then you could usenp.bincountwith weights:np.unique(a)returns[3 5 7], so the result appears in a different order:One potential problem with using
np.bincountis that it returns an array whose length is equal to the maximum value ina. Ifacontains even one element with value near 2**31-1, thenbincountwould have to allocate an array of size8*(2**31-1)bytes (or 16GiB).So
np.bincountmight be the fastest solution for arraysawhich have big length, but not big values. For arraysawhich have small length (and big or small values), using acollections.defaultdictwould probably be faster.Edit: See J.F. Sebastian’s solution for a way around the integer-values-only restriction and big-values problem.