I was studying hash-based sort and I found that using prime numbers in a hash function is considered a good idea, because multiplying each character of the key by a prime number and adding the results up would produce a unique value (because primes are unique) and a prime number like 31 would produce better distribution of keys.
key(s)=s[0]*31(len–1)+s[1]*31(len–2)+ ... +s[len–1]
Sample code:
public int hashCode( )
{
int h = hash;
if (h == 0)
{
for (int i = 0; i < chars.length; i++)
{
h = MULT*h + chars[i];
}
hash = h;
}
return h;
}
I would like to understand why the use of even numbers for multiplying each character is a bad idea in the context of this explanation below (found on another forum; it sounds like a good explanation, but I’m failing to grasp it). If the reasoning below is not valid, I would appreciate a simpler explanation.
Suppose MULT were 26, and consider
hashing a hundred-character string.
How much influence does the string’s
first character have on the final
value of ‘h’? The first character’s value
will have been multiplied by MULT 99
times, so if the arithmetic were done
in infinite precision the value would
consist of some jumble of bits
followed by 99 low-order zero bits —
each time you multiply by MULT you
introduce another low-order zero,
right? The computer’s finite
arithmetic just chops away all the
excess high-order bits, so the first
character’s actual contribution to ‘h’
is … precisely zero! The ‘h’ value
depends only on the rightmost 32
string characters (assuming a 32-bit
int), and even then things are not
wonderful: the first of those final 32
bytes influences only the leftmost bit
of `h’ and has no effect on the
remaining 31. Clearly, an even-valued
MULT is a poor idea.
I think it’s easier to see if you use 2 instead of 26. They both have the same effect on the lowest-order bit of
h. Consider a 33 character string of some charactercfollowed by 32 zero bytes (for illustrative purposes). Since the string isn’t wholly null you’d hope the hash would be nonzero.For the first character, your computed hash
his equal toc[0]. For the second character, you takeh* 2 +c[1]. So nowhis2*c[0]. For the third characterhis nowh*2 + c[2]which works out to4*c[0]. Repeat this 30 more times, and you can see that the multiplier uses more bits than are available in your destination, meaning effectivelyc[0]had no impact on the final hash at all.The end math works out exactly the same with a different multiplier like 26, except that the intermediate hashes will modulo
2^32every so often during the process. Since 26 is even it still adds one 0 bit to the low end each iteration.