First I would like to remark that I (as a Newb) did search through several Q & A regarding duplicates in a table though unfortunately for me, I couldn’t manipulate the code being used as answer.
My table is made out of a report being sorted in SQL Server 2008.
I would like to know how do I remove duplicate records and with an explanation.
"MyTable":
Column1 (PK-auto incremental table's record ID)
Column2 (some TXT)
Column3 (Some TXT)
Column4 (SmallDateTime)
Column5 is empty
Column5 will hold the value of SUM(count of deleted duplicates including this survived row)
The key to the solution in may case is if [column2 and column3] have multiple records with same content (hence Duplicates) only they don’t always share the same date (column4).
From this:
col1 col2 col3 col4 col5
---- ----- ---- ----------- ----
1 [abc] [4] [10/1/2012] null
2 [abc] [1] [12/1/2012] null
3 [ghi] [6] [4/1/2012] null
4 [def] [5] [8/1/2012] null
5 [abc] [4] [10/1/2012] null
6 [def] [5] [12/1/2012] null
7 [ghi] [6] [15/1/2012] null
8 [abc] [4] [17/1/2012] null
9 [ghi] [6] [6/1/2012] null
10 [abc] [1] [13/1/2012] null
Into this:
col1 col2 col3 col4 col5
---- ----- ---- ----------- ----
8 [abc] [4] [17/1/2012] 2
10 [abc] [1] [13/1/2012] 3
6 [def] [5] [12/1/2012] 2
7 [ghi] [6] [15/1/2012] 3
Meaning leave the latest (1) as a representation of every duplicated record.
++ReEditing++
Aaron Bertrand
shawnt00
e2nburner… and rest of u all
i can’t say how much i thank your Reply although i did not yet comprehend that mass of code.
i am now going to check those codes but not b4 thanking you guys !!
when i first started to program and needed sql querys, after using
Select * From MyTable
… my 1’st SQL Statement …
i said HEY i know SQL !!! …. Now … look at that deep knowledge of you guys … THANKS A LOT i know that this post in StackOverFlow will be further useful for other beginners too
This answer uses a common table expression to apply row_number() and count() to each “slice” of data (meaning grouped by col2 + col3). The count() is used to identify how many rows belong to each such group, and the row_number() is used to apply a “rank” ordered by col4 desc (1 = latest per group, 2 = 2nd latest, etc). This also uses col1 (which looks like a unique column) to break any ties. The CTE can be followed by a query such as a select, update, delete, etc. So you can run the first select to validate that these are the rows you want to keep, and that the counts are correct. If they are, then you can proceed with the updates and deletes. You’ll notice that in all cases the row_number() output is used to identify the rows you keep or the rows you discard.
To identify the rows you want to keep:
Once you’ve confirmed that those are the row you want to keep, you can update them like this:
Then delete the remainders this way:
Or even more simply (assuming col5 was completely null before the update):