I have a data set that may contain duplicates. In order to find the duplicates in the dataset I put the indices into a numpy structured array, sort the array, create another array from the unique values and then compare the lengths of the two arrays:
data = np.zeros(t_len, dtype={'names':['date', 'symbol'], 'formats':['i8', 'S16']})
data[:] = [(x['date'], x['symbol']) for x in tbl.iterrows()]
data.sort(order=['date', 'symbol'])
data2 = np.unique(data)
duplicates = False
if len(data) != len(data2):
duplicates = True
print "There are duplicates"
if not duplicates:
print "No duplicates found"
Now, what I would really like to do is determine the indices that contain the duplicates. For example, if I had a dataset that contained:
array([12322323,'IBM'], [12322323,'IBM'], [12322323,'MSFT'], [12322323,'IBM'])
I would like to know see an array with array([12322323,’IBM’])
I’ve looked into using unique and difference functions, but those don’t seem to do the job.
For simplicity, I’ll just use an array of integers,
x, as the input:With numpy version 1.9.0 or later, we can use
np.uniqueto get the unique elements, with the argumentreturn_counts=Trueso that the number of occurrences of each unique element is also returnedFor older versions of numpy, one can use
np.uniquewith the argumentreturn_inverse=Trueto also get the array that shows how to recreatexfrom the array of unique elements:Now use
bincountto count the number of occurrences of each element:So now we have
counts, which tells us how many times each element occurs in the array. We can pull out the elements that have duplicates as follows: