In a large user database with the following format and sample data, we are

Question

0

Asked: June 10, 20262026-06-10T14:05:35+00:00 2026-06-10T14:05:35+00:00

In a large user database with the following format and sample data, we are

0

In a large user database with the following format and sample data, we are trying to identify duplicated people:

id   first_name    last_name   email
---------------------------------------------------
 1   chris         baker       
 2   chris         baker       chris@gmail.com
 3   chris         baker       chris@hotmail.com
 4   chris         baker       crayzyguy@crazy.com  
 5   carl          castle      castle@npr.org
 6   mike          rotch       fakeuser@sample.com

I am using the following query:

SELECT 
    GROUP_CONCAT(id) AS "ids",
    CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
    COUNT(*) AS "duplicate_count" 
FROM 
    users 
GROUP BY 
    name 
HAVING 
    duplicate_count > 1

This works great; I get a list of duplicates with the id numbers of the involved rows.

We would re-assign any associated data tied to a duplicate to the actual person (set user_id = 2 where user_id = 3), then we delete the duplicating user row.

The trouble comes after we make this report the first time, as we clean up the list after manually verifying that they are indeed duplicates — some ARE NOT duplicates. There are 2 Chris Bakers that are legitimate users.

We don’t want to keep seeing Chris Baker in subsequent duplicate reports until the end of time, so I am looking for a way to flag that user id 1 and user id 4 are NOT duplicates of each other for future reports, but they could be duplicated by new users added later.

What I tried

I added a is_not_duplicate field to the user table, but then if a new duplicate “Chris Baker” gets added to the database, it will cause this situation to not show on the duplicate report; the is_not_duplicate improperly excludes one of the accounts. My HAVING statement would not meet the > 1 threshold until there are -two- duplicates of Chris Baker, plus the “real” one marked is_not_duplicate.

Question Summed Up

How can I build exceptions into the above query without looping results or multiple queries?

Sub-queries are fine, but the size of the dataset makes every query count and I’d like the solution to be as performant as possible.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-10T14:05:36+00:00

Try to add the is_not_duplicate boolean field and modify your code as follows:

SELECT 
    GROUP_CONCAT(id) AS "ids",
    CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
    COUNT(*) AS "duplicate_count",
    SUM(is_not_duplicate) AS "real_count"
FROM 
    users 
GROUP BY 
    name 
HAVING 
    duplicate_count > 1
AND
    duplicate_count - real_count > 0

Newly added duplicates will have is_not_duplicate=0 so the real_count for that name will be less than duplicate_count and the row will be shown

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

In a large user database with the following format and sample data, we are

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply