Found this code on http://www.docjar.com/html/api/java/util/HashMap.java.html after searching for a HashMap implementation.
264 static int hash(int h) {
265 // This function ensures that hashCodes that differ only by
266 // constant multiples at each bit position have a bounded
267 // number of collisions (approximately 8 at default load factor).
268 h ^= (h >>> 20) ^ (h >>> 12);
269 return h ^ (h >>> 7) ^ (h >>> 4);
270 }
Can someone shed some light on this? The comment tells us why this code is here but I would like to understand how this improves a bad hash value and how it guarantees that the positions have bounded number of collisions. What do these magic numbers mean?
In order for it to make any sense it has to be combined with an understanding of how HashMap allocates things in to buckets. This is the trivial function by which a bucket index is chosen:
So you can see, that with a default table size of 16, only the 4 least significant bits of the hash actually matter for allocating buckets! (16 – 1 = 15, which masks the hash by 1111b)
This could clearly be bad news if your hashCode function returned:
Such a hash function would not likely be “bad” in any way that is visible to its author. But if you combine it with the way the map allocates buckets, boom, MapFail(tm).
If you keep in mind that h is a 32 bit number, those are not magic numbers at all. It is systematically xoring the most significant bits of the number rightward into the least significant bits. The purpose is so that “differences” in the number that occur anywhere “across” it when viewed in binary become visible down in the least significant bits.
Collisions become bounded because the number of different numbers that have the same relevant LSBs is now significantly bounded because any differences that occur anywhere in the binary representation are compressed into the bits that matter for bucket-ing.