I have seen that a prime number implementation of the GetHashCode function is being recommend, for example here. However using the following code (in VB, sorry), it seems as if that implementation gives the same hash density as a “naive” xor implementation. If the density is the same, I would suppose there is the same probability of collision in both implementations. Am I missing anything on why is the prime approach preferred?
I am supossing that if the hash code is a byte I do not lose generality for the integer case.
Sub Main()
Dim XorHashes(255) As Integer
Dim PrimeHashes(255) As Integer
For i = 0 To 255
For j = 0 To 255
For k = 0 To 255
XorHashes(GetXorHash(i, j, k)) += 1
PrimeHashes(GetPrimeHash(i, j, k)) += 1
Next
Next
Next
For i = 0 To 255
Console.WriteLine("{0}: {1}, {2}", i, XorHashes(i), PrimeHashes(i))
Next
Console.ReadKey()
End Sub
Public Function GetXorHash(ByVal valueOne As Integer, ByVal valueTwo As Integer, ByVal valueThree As Integer) As Byte
Return CByte((valueOne Xor valueTwo Xor valueThree) Mod 256)
End Function
Public Function GetPrimeHash(ByVal valueOne As Integer, ByVal valueTwo As Integer, ByVal valueThree As Integer) As Byte
Dim TempHash = 17
TempHash = 31 * TempHash + valueOne
TempHash = 31 * TempHash + valueTwo
TempHash = 31 * TempHash + valueThree
Return CByte(TempHash Mod 256)
End Function
The probability of collisions also depends on the expected distribution of the input data. In your example you assume input data that is uniformly distributed over the entire range. This is the ideal situation and it’s no surprise that both algorithms perform well.
However, if you assume that the input data generally is similar in the high bits and differs mostly only in the low bits (note: a lot of real data is like this), the prime number method will spread this variation out over the whole hash whereas the XOR method will not – small changes in the low bits of two or more values can easily cancel each other out when XOR’ed. So the prime number method is less likely to collide in this case.
Also you should use 32-bit values for GetHashCode, not 8-bit values.