I got a string of an arbitrary length (lets say 5 to 2000 characters) which I would like to calculate a checksum for.
Requirements
- The same checksum must be returned each time a calculation is done for a string
- The checksum must be unique (no collisions)
- I can not store previous IDs to check for collisions
Which algorithm should I use?
Update:
- Are there an approach which is reasonable unique? i.e. the likelihood of a collision is very small.
- The checksum should be alphanumeric
- The strings are unicode
- The strings are actually texts that should be translated and the checksum is stored with each translation (so a translated text can be matched back to the original text).
- The length of the checksum is not important for me (the shorter, the better)
Update2
Let’s say that I got the following string "Welcome to this website. Navigate using the flashy but useless menu above".
The string is used in a view in a similar way to gettext in linux. i.e. the user just writes (in a razor view)
@T("Welcome to this website. Navigate using the flashy but useless menu above")
Now I need a way to identity that string so that I can fetch it from a data source (there are several implementations of the data source). Having to use the entire string as a key seems a bit inefficient and I’m therefore looking for a way to generate a key out of it.
That’s not possible.
If you can’t store previous values, it’s not possible to create a unique checksum that is smaller than the information in the string.
Update:
The term "reasonably unique" doesn’t make sense, either it’s unique or it’s not.
To get a reasonably low risk of hash collisions, you can use a resonably large hash code.
The MD5 algorithm for example produces a 16 byte hash code. Convert the string to a byte array using some encoding that preserves all characters, for example UTF-8, calculate the hash code using the
MD5class, then convert the hash code byte array into a string using theBitConverterclass:Output: