I’m (experimentally) doing a project where I have to merge data from several data sets into a single SQL Server 2012 database. Some data is duplicated in these sets, and I’m working on a way to detect and remove duplicates. My current test is doing a hash of the data items and checking for duplicate hashes. This seems to work really well so far (if there are are hash collisions, it isn’t the end of the world).
I’m storing this hash in the database as a ‘binary(32)’ and whenever I need to insert a new row (I’m actually using a MERGE), I look for the hash value and only insert if it isn’t found. I have an index on the hash column to aid this search.
The problem I’m having is that the index is always extremely fragmented, and I’m sure this must be slowing things down unnecessarily. I assume this is due to the near-randomness of the binary data.
Are there are any index options I could be using to limit this fragmentation? At the moment I’m just using the defaults. Any clues would be appreciated.
Thanks in advance.
No answers unfortunately, but I did find that rebuilding the index periodically during the insertion phase helped, but obviously came with additional overhead. It wasn’t particularly worth it. I suspect experimenting with the fill factor may help also, but haven’t had time to investigate this fully.