I have a data set that may contain duplicates. In order to find the

Question

0

Asked: June 11, 20262026-06-11T21:23:37+00:00 2026-06-11T21:23:37+00:00

I have a data set that may contain duplicates. In order to find the

0

I have a data set that may contain duplicates. In order to find the duplicates in the dataset I put the indices into a numpy structured array, sort the array, create another array from the unique values and then compare the lengths of the two arrays:

data = np.zeros(t_len, dtype={'names':['date', 'symbol'], 'formats':['i8', 'S16']})
data[:] = [(x['date'], x['symbol']) for x in tbl.iterrows()]
data.sort(order=['date', 'symbol'])
data2 = np.unique(data)
duplicates = False

if len(data) != len(data2):
    duplicates = True
    print "There are duplicates"

if not duplicates:
    print "No duplicates found"

Now, what I would really like to do is determine the indices that contain the duplicates. For example, if I had a dataset that contained:

array([12322323,'IBM'], [12322323,'IBM'], [12322323,'MSFT'], [12322323,'IBM'])

I would like to know see an array with array([12322323,’IBM’])

I’ve looked into using unique and difference functions, but those don’t seem to do the job.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-11T21:23:39+00:00

For simplicity, I’ll just use an array of integers, x, as the input:

>>> x = np.array([20, 10, 30, 10, 60, 30, 10])

With numpy version 1.9.0 or later, we can use np.unique to get the unique elements, with the argument return_counts=True so that the number of occurrences of each unique element is also returned

>>> u, counts = np.unique(x, return_counts=True)

For older versions of numpy, one can use np.unique with the argument return_inverse=True to also get the array that shows how to recreate x from the array of unique elements:

>>> u, inv = np.unique(x, return_inverse=True)
>>> u
array([10, 20, 30, 60])
>>> inv
array([1, 0, 2, 0, 3, 2, 0])

Now use bincount to count the number of occurrences of each element:

>>> counts = np.bincount(inv)
>>> counts
array([3, 1, 2, 1])

So now we have counts, which tells us how many times each element occurs in the array. We can pull out the elements that have duplicates as follows:

>>> dups = u[counts > 1]
>>> dups
array([10, 30])

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a data set that may contain duplicates. In order to find the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply