I’ve been thinking about HashMaps/Dictionaries recently, and I realise there is a gap in my understanding of their implementation.
From my data structures classes, I was told that a hash value will map a key directly to a location in a vector of linked-list buckets. According to wikipedia, MurmurHash creates a 32-bit or 128-bit value. Obviously that value cannot map directly to a location in memory. How is that hash value used to assign a location in the underlying vector to the object being placed in the hash map?
After reading David Robinson’s answer I want to expand my question:
If the mapping is based on the size of the underlying vector of lists, what happens when the vector is resized?
Typically when the result of a hash is generated, it is put through modulo
N, whereNis the size of the allocated vector of linked lists. Pseudocode:This lets the implementer decide on the length of the vector of linked lists (that is, how much he wants to trade space efficiency for time efficiency), while keeping the result pseudorandom.
A common alternative worth mentioning is bit masking- for example, ignoring all but the
bleast significant bits. (For instance, performing the operationx & 7ignores all but the 3 least significant bits). This is equivalent to x modulo 2^b, it just happens to be faster on most operating systems.To answer your second question: if the vector has to be resized, then each value stored in the dictionary does indeed need to be remapped.
There is an excellent blog post on the implementation of dictionaries in Python that explains how that language implements its built in hash table (which it calls dictionaries). In that implementation:
The dictionary is resized (made larger) if more than 2/3 of its slots are being used
The list of slots is resized to 4 times its current size
Every value in the old list of slots is remapped to the new list of slots.
There are many other useful optimizations described in that blog post; it gives an excellent view of the practical aspects of implementing a hash table.