I am integrating between 4 data sources:
InternalDeviceRepository
ExternalDeviceRepository
NightlyDeviceDeltas
MidDayDeviceDeltas
Changes flow into the InternalDeviceRepository from the other three sources.
All sources eventually are transformed to have the definition of
FIELDS
=============
IdentityField
Contract
ContractLevel
StartDate
EndDate
ContractStatus
Location
IdentityField is the PrimaryKey, Contract Key is a secondary Key only if a match exists, otherwise a new record needs to be created.
Currently I compare all the fields in a WHERE clause in SQL Statements and also in a number of places in SSIS packages. This creates some unclean looking SQL and SSIS packages.
I’ve been mulling computing a hash of ContractLevel, StartDate, EndDate, ContractStatus, and Location and adding that to each of the input tables. This would allow me to use a single value for comparison, instead of 5 separate ones each time.
I’ve never done this before, nor have I seen it done. Is there a reason that it should be used, or is that a cleaner way to do it?
It is a valid approach. Consider to introduce a calculated field with the hash and index on it.
You may use either CHECKSUM function or write your own hash function like this:
which will give you 16-byte value – you may take all the 16 bytes as Guid, or only first 8-bytes as bigint and compare it.
Adapt the function in your way – to accept string as parameter or even all the your fields instead of varbinary
BUT