I have a set of stored procedures. Each stored procedure supposedly keeps a specific database table in sync with an identical one in another database.
The database tables have up to hundreds of millions of records. I need to find the quickest way to validate that these procedures are really keeping everything in sync, and I need to be able to locate records which vary between the two tables for each procedure (for debugging purposes).
I was informed that the following (found somewhere on SO I believe, but I don’t have the link as it was a while back):
Insert into target_table(columns)
select columns from table1
except
select columns from table2
Insert into target_table(columns)
select columns from table2
except
select columns from table1
Wouldn’t work fast enough. Can anyone suggest another way to do this that would be faster – either using T-SQL procedures, or even external C# code? (I thought C# code might let me store PKs for hashing purposes so I could at least track the primary keys and find which were surperfluous/missing even if I didn’t track the rest of the fields).
Is fairly difficult to do this, but you can get some mileage out of checksums. One approach is to split the key range into several subranges that can be verified a) in parallel and/or b) at different scheduled intervals. Eg:
The main issue is that identifying the differences can only be achieved by scanning all the rows, which is very expensive. Using ranges you can submit various ranges to be verified on a rotating schedule. The
CHECKSUM_AGGandBINARY_CHECKSUM(*)restrictions apply, of course: