I have mistakenly loaded duplicate files into a database table (IBM DB2 v9.7). I need to delete the duplicate records without deleting valid data.
Initially, I though of HAVING count(*) > 1 as the solution to my problem but this will not work. Our supplier produces parts with modified specs so a file may be loaded more than once with valid data.
I know a few things:
- the date range for my duplicate records: between ‘2012-08-27’ and
‘2012-09-02’ - the attributes to use to validate data
This is my SQL code to identify the dupes:
SELECT CAST(ENDDATE AS DATE) ENDDATE,CAST(LOADEDON AS DATE),SUBSTR(SITEID,1,20) SITEID,SUBSTR(LOCATIONNAME_1,1,20),SUBSTR(RID,1,15),COUNT(RID) FROM AUTOMATION WHERE CAST(ENDDATE AS DATE) BETWEEN '2012-08-27' AND '2012-09-02' GROUP BY CAST(ENDDATE AS DATE),CAST(LOADEDON AS DATE),SUBSTR(SITEID,1,20),SUBSTR(LOCATIONNAME_1,1,20),SUBSTR(RID,1,15) ORDER BY 5 ASC FOR FETCH ONLY WITH UR
EDIT: set of columns that can be used to specify a duplicate are RID,LOADEDON and FILENAME (not shown here).
This is a sample output
08/29/2012 09/05/2012 JGS Memphis JGS Memphis 029369751671 518
09/01/2012 09/05/2012 Reynosa Reynosa 029054883474 521
08/29/2012 09/05/2012 JGS Memphis JGS Memphis 028881223425 522
I want to delete all the duplicate records in the timeframe ‘2012-08-27’ AND ‘2012-09-02’ without deleting the records that are loaded N times for legit reasons.
Note: the table does not have a primary key (like Rowid in MS Sqlserver, for instance)
I can’t quite tell which set of columns specifies a duplicate. The following assumes that it is the columns in your sample output:
This uses row_number() to assign sequential numbers and deletes all but the first row, guaranteeing that one stays in the database.