I am working on a document management system. Some documents were imported from another system. Due to an error, some of them were imported twice. I need to delete the duplicates. I have the document id from the previous system, but can’t just delete by that as some documents are associated with multiple accounts and are supposed to be in there twice, so I have to check against that as well. The associated values are in different tables. I have created the following script to come up with the doc id’s to delete but it is incredibly slow (it has been running for four days on a table with less than 2 million records).
declare @docidtodelete int
declare @docid int
declare @sourcedocid varchar(12)
declare @taxid decimal(9,0)
declare @account bigint
select @docid = MIN(d.docid) from DOCS d
inner join CONTENTS c on d.DOCID = c.DOCID and c.FOLID=1
while @docid is not null
begin
--get the source document id for this document
select @sourcedocid = val from VTAB0031 where IDXID=31 and DOCID=@docid
-- see if there is another document with the same source document id
select @docidtodelete = isnull(MAX(v.docid),0) from VTAB0031 v
inner join CONTENTS c on v.DOCID = c.DOCID and c.FOLID=1
where IDXID=31 and VAL = @sourcedocid
if @docid<@docidtodelete -- we have a possible duplicate so lets check and see if it matches on account
begin
select @account = val from VTAB0002 where IDXID=2 and DOCID=@docid
select @docidtodelete = isnull(max(v.docid),0) from VTAB0002 v
where IDXID=2 and VAL = @account and v.DOCID=@docidtodelete
if @docid<@docidtodelete -- we still have a possible duplicate so lets check and see if it matches on taxid
begin
select @taxid = val from VTAB0006 where IDXID=6 and DOCID=@docid
select @docidtodelete = isnull(max(v.docid),0) from VTAB0006 v
where IDXID=6 and VAL = @taxid and v.DOCID = @docidtodelete
if @docid<@docidtodelete -- we still have a match so delete
begin
insert into deletedDuplicates values(@docidtodelete ,@docid)
end
end
end
select @docid = MIN(d.docid) from DOCS d
inner join CONTENTS c on d.DOCID = c.DOCID and c.FOLID=1
where d.DOCID > @docid
end
It’s always better to use set operations rather than procedural operations when working with an RDBMS.
Try this instead:
UPDATE
I made a new query that should get all duplicate records. And this should run more effeciently as well, using the indexes.