I have a 100000000×2 array named “a”, with an index in the first column and a related value in the second column. I need to get the median values of the numbers in the second column for each index. This is how I colud do it with a for statement:
import numpy as np
b = np.zeros(1000000)
a = np.array([[1, 2],
[1, 3],
[2, 3],
[2, 4],
[2, 6],
[1, 4],
...
...
[1000000,6]])
for i in xrange(1000000):
b[i]=np.median(a[np.where(a[:,0]==i),1])
Obviously it’s too slow with the for iteration: any suggestions? Thanks
A quick 1-line approach:
I’m not convinced there’s much you can do to make that go faster without sacrificing accuracy. But here’s another attempt, which might be faster if you can skip the sort step:
The latter is very slightly faster for small arrays. Not sure if it’s fast enough.