Is there any fast way to obtain unique elements in numpy? I have code similar to this (the last line)
tab = numpy.arange(100000000)
indices1 = numpy.random.permutation(10000)
indices2 = indices1.copy()
indices3 = indices1.copy()
indices4 = indices1.copy()
result = numpy.unique(numpy.array([tab[indices1], tab[indices2], tab[indices3], tab[indices4]]))
This is just an example and in my situation indices1, indices2,...,indices4 contains different set of indices and have various size. The last line is executed many times and Inoticed that it’s actually the bottleneck in my code ({numpy.core.multiarray.arange} to be precesive). Besides, ordering is not important and element in indices array are of int32 type. I was thinking about using hashtable with element value as key and tried:
seq = itertools.chain(tab[indices1].flatten(), tab[indices2].flatten(), tab[indices3].flatten(), tab[indices4].flatten())
myset = {}
map(myset.__setitem__, seq, [])
result = numpy.array(myset.keys())
but it was even worse.
Is there any way to speed this up? I guess the performance penalty comes from ‘fancy indexing’ that copy the array but I need the resulting element only to read (I don’t modify anything).
Sorry I don’t completely understand your question, but I’ll do my best to help.
Fist {numpy.core.multiarray.arange} is numpy.arange not fancy indexing, unfortunately fancy indexing does not show up as a separate line item in the profiler. If you’re calling np.arange in the loop you, should see if you can move it outside.
Second I assume that
tabis notnp.arange(a, b)in your code, because if it is thantab[index] == index + a, but I assume that was just for your example.Third, np.concatenate is about 10 times faster than np.array
(Also np.concatenate gives a (4*n,) array and np.array gives a (4, n) array, where n is the length if indices[1-4]. The latter will only work if the indices1-4 are all the same length.)
And last, you could also save even more time if you can do the following:
Doing it in this order is faster because you reduce the number of indices you need to look up in tab, but it’ll only work if you know that the elements of tab are unique (otherwise you could get repeats in result even if the indices are unique).
Hope that helps