Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8462165
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 10, 20262026-06-10T14:05:35+00:00 2026-06-10T14:05:35+00:00

In a large user database with the following format and sample data, we are

  • 0

In a large user database with the following format and sample data, we are trying to identify duplicated people:

id   first_name    last_name   email
---------------------------------------------------
 1   chris         baker       
 2   chris         baker       chris@gmail.com
 3   chris         baker       chris@hotmail.com
 4   chris         baker       crayzyguy@crazy.com  
 5   carl          castle      castle@npr.org
 6   mike          rotch       fakeuser@sample.com  

I am using the following query:

SELECT 
    GROUP_CONCAT(id) AS "ids",
    CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
    COUNT(*) AS "duplicate_count" 
FROM 
    users 
GROUP BY 
    name 
HAVING 
    duplicate_count > 1

This works great; I get a list of duplicates with the id numbers of the involved rows.

We would re-assign any associated data tied to a duplicate to the actual person (set user_id = 2 where user_id = 3), then we delete the duplicating user row.

The trouble comes after we make this report the first time, as we clean up the list after manually verifying that they are indeed duplicates — some ARE NOT duplicates. There are 2 Chris Bakers that are legitimate users.

We don’t want to keep seeing Chris Baker in subsequent duplicate reports until the end of time, so I am looking for a way to flag that user id 1 and user id 4 are NOT duplicates of each other for future reports, but they could be duplicated by new users added later.

What I tried

I added a is_not_duplicate field to the user table, but then if a new duplicate “Chris Baker” gets added to the database, it will cause this situation to not show on the duplicate report; the is_not_duplicate improperly excludes one of the accounts. My HAVING statement would not meet the > 1 threshold until there are -two- duplicates of Chris Baker, plus the “real” one marked is_not_duplicate.

Question Summed Up

How can I build exceptions into the above query without looping results or multiple queries?

Sub-queries are fine, but the size of the dataset makes every query count and I’d like the solution to be as performant as possible.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-10T14:05:36+00:00Added an answer on June 10, 2026 at 2:05 pm

    Try to add the is_not_duplicate boolean field and modify your code as follows:

    SELECT 
        GROUP_CONCAT(id) AS "ids",
        CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
        COUNT(*) AS "duplicate_count",
        SUM(is_not_duplicate) AS "real_count"
    FROM 
        users 
    GROUP BY 
        name 
    HAVING 
        duplicate_count > 1
    AND
        duplicate_count - real_count > 0
    

    Newly added duplicates will have is_not_duplicate=0 so the real_count for that name will be less than duplicate_count and the row will be shown

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

i have a large user Database (13k+), and for some reason i need to
I have situation where a user can manipulate a large set of data (presented
I'm developing (.NET MVC) a large website which has the following specifications: Database pages
I get the following error when trying to load a large db into linqpad.
In my current project, we use lots of user controls. Large user controls (500+
I've sort of paginated a large menu — user sees 12 items at a
I have a large config file (user) that i needed to go to the
An user posts this article about how to use HttpResponse.Filter to compress large amounts
In a large Application is there any way to distinguish user-defined classes with built-in
I have a arbitrarily large string of text from the user that needs to

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.