I was messing around with making a command line parser and was wondering what kind of hash algorithm python dict’s use?
The way I have it set up, I have a pattern match algorithm which matches tokenized input sequences with a dictionary key. Some of the keys are relatively long (length 5 or 6 tuples of 6-7 character strings). I was wondering if there was a point at which long dictionary keys significantly reduce the efficiency of key retrieval.
The hash that it uses depends on the object being used as a key — each class can define its own __hash__() method, and the value that it returns for a particular instance is what is used for the dictionary.
Python itself provides the hash implementation for str and tuple types. A quick look at the source should reveal the exact algorithm for those.
The hash of a tuple is based on the hashes of its contents. The algorithm is essentially this (simplified slightly):
For strings, the interpreter also iterates over every character, combining them with this (again, slightly simplified) algorithm:
A bigger issue to worry about is avoiding hash collisions. Colliding hash keys will cause a linear search as the dictionary tries to find a place to store the new object (this is now being recognized as a security issue, and tha behavior may be changing in upcoming python versions)