I’ve got a huge (>>10m) list of entries. Each entry offers two hash functions:
- Cheap: quickly computes hash, but its distribution is terrible (may put 99% of items in 1% of hash space)
- Expensive: takes a lot of time to compute, but the distribution is a lot better also
An ordinary Dictionary lets me use only one of these hash functions. I’d like a Dictionary that uses the cheap hash function first, and checks the expensive one on collisions.
It seems like a good idea to use a dictionary inside a dictionory for this. I currently basically use this monstrosity:
Dictionary<int, Dictionary<int, List<Foo>>>;
I improved this design so the expensive hash gets called only if there are actually two items of the same cheap hash.
It fits perfectly and does a flawless job for me, but it looks like something that should have died 65 million years ago.
To my knowledge, this functionality is not included in the basic framework. I am about to write a DoubleHashedDictionary class but I wanted to know of your opinion first.
As for my specific case:
First hash function = number of files in a file system directory (fast)
Second hash function = sum of size of files (slow)
Edits:
- Changed title and added more informations.
- Added quite important missing detail
First off, I think you’re on the right path to implement your own hashtable, if what you are describing is truely desired.But as a critic, I’d like to ask a few questions:
Have you considered using something more unique for each entry?
I am assuming that each entry is a file system directory information, have you considered using its full path as key? prefixing with computer name/ip address?
On the other hand, if you’re using number of files as hash key, are those directories never going to change? Because if the hash key/result changes, you will never be able to find it again.
While on this topic, if the directory content/size is never going to change, can you store that value somewhere to save the time to actually calculate that?
Just my few cents.