Although this issue has been brought up in the past, I’m curious if this is still the best way to clean up duplicate entries in a large (3M and growing) table. After each bulk insert I run this line to keep things tidy, but it’s starting to take a very long time to execute.
Duplicate rows can only be determined through 3 columns. The others either auto increment, have uniqueIDs, sources, etc.
Here’s what I currently have going –
DELETE n1
FROM main n1, main n2
WHERE n1.id < n2.id
AND n1.col1 = n2.col1
AND n1.col2 = n2.col2
AND n1.col3 = n2.col3
Any chance I could speed this up, or is this as good as it gets?
Thank you for any help/insight!
Add a unique Index to your table on columns col1, col2 and col2 like this.
And this will prevent inserting duplicate rows to your table.
For example:
After you insert this values;
You can’t insert this, you will get duplicate row error
With correct unique indexes you don’t have to worry later for duplicate records.