We have an application that
- Generates a hash code on a string
- Saves that hash code into a DB along with associated data
- Later, it queries the DB using the string hash code for retrieving the data
This is obviously a bug because the value returned from string.GetHashCode() varies from .NET versions and architectures (32/64 bit). To complicate matters, we’re too close to a release to refactor our application to stop serializing hash codes and just query on the strings instead. What we’d like to do is come up with a quick and dirty fix for now, and refactor the code later to do it the right way.
The quick and dirty fix seems like creating a static GetInvariantHashCode(string s) helper method that is consistent across architectures.
Can suggest an algorithm for generating a hashcode on a string that is equivalent on 32 bit and 64 bit architecture?
A few more notes:
- I’m aware that HashCodes are not unique. If a hashcode returns a match on two different strings, we post process the results to find the exact match. It is not used as a primary key.
- I believe the architect’s intent was to speed up the searches by querying on a long instead of an NVarChar
Then just let the database index the strings for you!
Look, I have no idea how large your domain is, but you’re going to get collisions very rapidly with very high likelihood if it’s of any decent size at all. It’s the birthday problem with a lot of people relative to the number of birthdays. You’re going to have collisions, and lose any gain in speed you might think you’re gaining by not just indexing the strings in the first place.
Anyway, you don’t need us if you’re stuck a few days away from release and you really need an invariant hash code across platform. There are really dumb, really fast implementations of hash code out there that you can use. Hell, you could come up with one yourself in the blink of an eye:
Or you could use the old Bernstein hash. And on and on. Are they going to give you the performance gain you’re looking for? I don’t know, they weren’t meant to be used for this purpose. They were meant to be used for balancing hash tables. You’re not balancing a hash table. You’re using the wrong concept.
Edit (the below was written before the question was edited with new salient information):
You can’t do this, at all, theoretically, without some kind of restriction on your input space. Your problem is far more severe than
String.GetHashCodedifferening from platform to platform.There are a lot of instances of
string. In fact, way more instances than there are instances ofInt32. So, because of the piegonhole principle, you will have collisions. You can’t avoid this: yourstrings are pigeons and yourInt32hash codes are piegonholes and there are too many pigeons to go in the pigeonholes without some pigeonhole getting more than one pigeon. Because of collision problems, you can’t use hash codes as unique keys for strings. It doesn’t work. Period.The only way you can make your current proposed design work (using
Int32as an identifier for instances ofstring) is if you restrict your input space of strings to something that has at size less than or equal to the number ofInt32s. Even then, you’ll have difficulty coming up with an algorithm that maps your input space ofstrings toInt32in a unique way.Even if you try to increase the number of pigeonholes by using SHA-512 or whatever, you still have the possibility of collisions. I doubt you considered that possibility previously in your design; this design path is DOA. And that’s not what SHA-512 is for anyway, it’s not to be used for unique identification of messages. It’s just to reduce the likelihood of message forgery.
Well, then you have a tremendous amount of work ahead of you. I’m sorry you discovered this so late in the game.
I note the documentation for
String.GetHashCode:And from
Object.GetHashCode:Hash codes are for balancing hash tables. They are not for identifying objects. You could have caught this sooner if you had used the concept for what it is meant to be used for.