We are currently extensively using the GetHashCode method to store hash codes in a database for tracking unique items. MSDN has a scary entry about this here
“The default implementation of the GetHashCode method does not guarantee unique return values for different objects. Furthermore, the .NET Framework does not guarantee the default implementation of the GetHashCode method, and the value it returns will be the same between different versions of the .NET Framework. Consequently, the default implementation of this method must not be used as a unique object identifier for hashing purposes.”
We have been using this approach for several years without issue. Should we be worried, and if so what would be a better approach?
To elaborate, the data is coming from an external source. We are taking two to three string fields, adding them together into a new string, and then using the GetHashCode off of that.
Using a hash code as a unique identifier is a really bad idea because you’re eventually guaranteed to have collisions if the collection is large enough — and it doesn’t have to be very large before you’re statistically likely to have a collision. Hash codes are a good, quick way to evaluate if two objects are the same when (assuming the same hash function) – if they hash to different values, they are definitely different. If they hash to the same value, however, then you need to do an equality comparison to make sure that they are the same object. At that point you need to compare the properties of the object that make it unique, i.e., if these properties are the same, then the objects are the same.
I’d suggest using a unique index in the database on the natural key properties in conjunction with an artificial, autoincrement id as the primary key. Then you can be sure that you don’t get duplicate insertions in the DB (uniqueness constraint of the index), but you can quickly compare the objects outside the DB by simply comparing whether they have the same id — also guaranteed to be unique by the primary key constraint.