I’m trying to find duplicate customers in a table that looks like this:
customer_id | first_name | last_name
-------------------------------------
0 | Rich | Smith
1 | Paul | Jones
2 | Richard | Smith
3 | Jimmy | Roberts
In this situation, I need a query that will return with customer_id 0 and customer_id 2. The query needs to find matches where a customer may have shortened their name, Rich instead of Richard — or Rob instead of Robert.
I have this query but it’s only returning ONE (not both) of the matches. I need both Rich & Richard returned by the query.
select distinct customers.customer_id, concat(customers.first_name,' ',customers.last_name) as name from customers
inner join customers dup on customers.last_name = dup.last_name
where (dup.first_name like concat('%', customers.first_name, '%')
and dup.customer_id <> customers.customer_id )
order by name
Can someone please point me in the right direction?
Per @tsOverflow , this is the final query that solved my problem:
select distinct customers.customer_id, concat(customers.first_name,' ',customers.last_name) as name
from customers
inner join customers dup on customers.last_name = dup.last_name
where ((dup.first_name like concat('%', customers.first_name, '%')
OR (customers.first_name like concat('%', dup.first_name, '%'))
)
and dup.customer_id <> customers.customer_id )
order by name
The above solution may have performance issues.
Your problem is because Rich is a substring of Richard, but not the other way around.
This will check for both ways:
I added the OR and do the like check the other way around.
Note that using like statement in query has performance implcations – I am not expert in this, just a thought.
EDIT:
As others mentioned on comment – this will only catch cases where the “shorten” version is really just a substring, it wont catch cases where Michael -> Mike, or William -> Bill, and on the other hand John and some guy named Johnson might be 2 totaly different people too.