I want to know the probability of getting duplicate values when calling the GetHashCode() method on string instances. For instance, according to this blog post, blair and brainlessness have the same hashcode (1758039503) on an x86 machine.
I want to know the probability of getting duplicate values when calling the GetHashCode()
Share
Large.
(Sorry Jon!)
The probability of getting a hash collision among short strings is extremely large. Given a set of only ten thousand distinct short strings drawn from common words, the probability of there being at least one collision in the set is approximately 1%. If you have eighty thousand strings, the probability of there being at least one collision is over 50%.
For a graph showing the relationship between set size and probability of collision, see my article on the subject:
https://learn.microsoft.com/en-us/archive/blogs/ericlippert/socks-birthdays-and-hash-collisions