I have a relatively large set of data that lends itself very naturally to the c#’s dictionary object. Currently, I have 102400 key-value pairs that are being generated semi-dynamically when my program starts up. My problem is that I have to run a great number of lookup operations as fast as possible.
According to This Page the speed of the lookups is directly influenced by the number of key-value pairs in the dictionary. My data is a bit odd in that a great number of different keys lead to the same value. In fact, I have only 4900 distinct values… This means that I have an average of 20 key-value pairs for each distinct value.
My first instinct was to swap the keys for the values (as I only care about the distinct values in the data) then have the old keys in a list or array as the new values. This reduced my dictionary size to 4900 from 102400 key-value pairs, but I can’t see any way to efficiently search all of the lists for a specific value to get the key by.
I know that my description probably got a bit dificult to follow as I switched the keys and values, so I’ve included a mock-up of my data to show you what I mean:
old method:
Key Value
--- -----
1 1
2 2
3 3
4 1
5 3
6 2
7 2
8 1
9 3
10 2
11 3
12 1
New structure:
Key Value
--- -----
1 {1,4,8,12}
2 {2,6,7,10}
3 {3,9,5,11}
In my program, I’m going to be given ’11’ and I’ll need to return ‘3’. The first structure is a simple lookup, but is a huge list which seems to be slowing things down… the second adds so much logical overhead to track down which value list I’m looking for that I’ve only seen a reduction in speed trying to implement it.
Am I barking up the wrong tree here? Should I just accept the speed of the larger list, or is there some other way that I can store my data to increase the lookup speed?
If all keys are distinct and contiguous, then you should consider a simple array; if the keys aren’t contiguous, then a hash map type of structure if they aren’t. This would be approaching O(1) if the hashing function is good, and if they are all integers, shouldn’t take up much space.
Even then, for 102400 elements, a binary tree lookup would take at most log2(102400) operations per lookup which is 16.64 operations, not exactly slow.