I am working with SQL 2000. I have gotten to a point where I can remove all of the unwanted duplicates based on a complicated set of criteria, but the query now takes hours to complete when it only used to take about 3.5 minutes to get the data with the duplicates included.
For Clarity:
I can have a duplicate rpt.Name field as long as either the rpt.HostName or rpt.SystemSerialNumber fields is also different. Also, I have to determine which entry to keep based on the time stamps of four different columns as some of those columns have missing time stamps.
Any help is greatly appreciated!
SELECT
rpt.[Name],
rpt.LastAgentExecution,
rpt.GroupName,
rpt.PackageName,
rpt.PackageVersion,
rpt.ProcedureName,
rpt.HostName,
rpt.SystemSerialNumber,
rpt.JobCreationTime,
rpt.JobActivationTime,
rpt.[Job Completion Time]
FROM DSM_StandardGroupMembersProcedureActivityViewExt rpt
WHERE
(
(
rpt.GroupName = 'Adobe Acrobat 7 Deploy'
OR rpt.GroupName = 'Adobe Acrobat 8 Deploy'
)
AND
(
(rpt.PackageName = 'Adobe Acrobat 7' AND rpt.PackageVersion = '-1.0')
OR (rpt.PackageName = 'Adobe Acrobat 8' AND rpt.PackageVersion = '-3.0')
)
)
AND NOT EXISTS
(
SELECT *
FROM DSM_StandardGroupMembersProcedureActivityViewExt rpt_dupe
WHERE
(
(
rpt.GroupName = 'Adobe Acrobat 7 Deploy'
OR rpt.GroupName = 'Adobe Acrobat 8 Deploy'
)
AND
(
(rpt.PackageName = 'Adobe Acrobat 7' AND rpt.PackageVersion = '-1.0')
OR (rpt.PackageName = 'Adobe Acrobat 8' AND rpt.PackageVersion = '-3.0')
)
AND
(
(rpt_dupe.[Name] = rpt.[Name])
AND
(
(rpt_dupe.SystemSerialNumber = rpt.SystemSerialNumber)
OR (rpt_dupe.HostName = rpt.HostName)
)
AND
(
(rpt_dupe.LastAgentExecution < rpt.LastAgentExecution)
OR (rpt_dupe.JobActivationTime < rpt.JobActivationTime)
OR (rpt_dupe.JobCreationTime < rpt.JobCreationTime)
OR (rpt_dupe.[Job Completion Time] < rpt.[Job Completion Time])
)
)
)
)
The reason is the not exists clause.
One suggests is to rewrite this as a left outer join:
I’ve found that not exists and not in often optimize poorly.
Another suggestion is to change this query to a more direct implementation:
That is, summarize the table by the columns that you want distinct. Choose one of the rows. Here, I assume there is an “id” field to uniquely identify each row. You might have to use a combination of fields, such as name and date. Without an id, this is more challenging. In more recent versions of SQL server, you can use row_number().