I have a scenario where I need to check if rows in a target database need updating from a source database. The source data is actually a view and data from that view gets pumped into a destination table. Because the source view collects/rolls-up/pivots data from several underlying tables we don’t really have a good way to change the schema to support change tracking, so my thought was to compute a hash of each row’s data and include that as part of the view. We can then compare the hash value in the destination table to see if there’s a difference and update accordingly.
I’m aware of the:
CHECKSUM
BINARY_CHECKSUM
HASHYBYTES
functions. Either CHECKSUM() or BINARY_CHECKSUM() seems to be the best option but I’m not sure how well it will perform over a view with 50 columns and a million+ rows. I’m also aware that the checksums/hashes generated may not be different even after an edit, but that’s tolerable in this case.
So the question: Is the hash/checksum approach a good way to do this and if so what’s the best function to use? Or is there another, better way entirely to approach the problem?
(Oh, running on SQL Server 2005 now but we’ll soon be moving to 2008R2, if that helps.)
I don’t know that I would trust
CHECKSUMactually. I’ve seen many cases where people documented that two different rows produced a collision. Do you just want to know that a row has changed (or doesn’t exist in the destination yet)? Have you discarded the possibility of usingROWVERSION? Are you potentially updating data in both places?Since you are moving to SQL Server 2008 R2 soon, have you thought about other methods that already exist, such as Change Tracking or Change Data Capture? (Comparison here.) There are also other ways to solve this problem that don’t involve caring which rows have changed, but this depends on your end goal. In an old system I worked with, we would push out primary data changes en masse into a separate schema, then play switcheroo when the data had arrived. Of course all the data was updated in the source, and it was ok for the destination to be minutes behind. But it prevented the hassle of figuring out deltas between the source and destination.