I have a known-good Dictionary, and at run time I need to create a new Dictionary and run a check to see if it has the same key-value pairs as the known-good Dictionary (potentially inserted in different orders), and take one path if it does and another if it doesn’t. I don’t necessarily need to serialize the entire known-good Dictionary (I could use a hash, for example), but I need some on-disk data that has enough information about the known-good Dictionary to allow for comparison, if not for recreation. What is the quickest way to do this? I can use a SortedDictionary, but the amount of time required to initialize and add values counts in the speed of this task.
Concrete example:
Consider a Dictionary<String,List<String>> that looks something like this (in no particular order, obviously):
{ {"key1", {"value1", "value2"} }, {"key2", {"value3", "value4"} } }
I create that Dictionary once and save some form of information about it on disk (a full serialization, a hash, whatever). Then, at runtime, I do the following:
Dictionary<String,List<String>> d1 = new Dictionary<String,List<String>> ();
Dictionary<String,List<String>> d2 = new Dictionary<String,List<String>> ();
Dictionary<String,List<String>> d3 = new Dictionary<String,List<String>> ();
String key11 = "key1";
String key12 = "key1";
String key13 = "key1";
String key21 = "key2";
String key22 = "key2";
String key23 = "key2";
List<String> value11 = new List<String> {"value1", "value2"};
List<String> value12 = new List<String> {"value1", "value2"};
List<String> value13 = new List<String> {"value1", "value2"};
List<String> value21 = new List<String> {"value3", "value4"};
List<String> value22 = new List<String> {"value3", "value4"};
List<String> value23 = new List<String> {"value3", "value5"};
dict1.add(key11, value11);
dict1.add(key21, value21);
dict2.add(key22, value22);
dict2.add(key12, value12);
dict3.add(key13, value13);
dict3.add(key23, value23);
dict1.compare(fileName); //Should return true
dict2.compare(fileName); //Should return true
dict3.compare(fileName); //Should return false
Again, if the overall time from startup to the return from compare() is quicker, I can change this code to use a SortedDictionary (or anything else) instead, but I can’t guarantee ordering and I need some consistent comparison. compare() could load a serialization and iterate through the dictionaries, it could serialize the in-memory dictionary and compare the serialization to the file name, or it could do any number of other things.
Solution one: use set equality.
If the dictionaries are of different sizes, you know they are unequal.
If they are of the same size then build a mutable hash set of keys from one dictionary. Remove from it all the keys from the other dictionary. If you attempted to remove a key that wasn’t there, then the key sets are unequal and you know which key was the problem.
Alternatively, build two hash sets and take their intersection; the resulting intersection should be the size of the original sets.
This takes O(n) time and O(n) space.
Once you know that the key sets are equal then go through all the keys one at a time, fetch the values, and do comparison of the values. Since the values are sequences, use SequenceEquals. This takes O(n) time and O(1) space.
Solution two: sort the keys
Again, if the dictionaries are of different size, you know they are unequal.
If they are of the same size, sort both sets of keys and do a SequenceEquals on them; if the sequences of keys are unequal then the dictionaries are unequal.
This takes O(n lg n) time and O(n) space.
If that succeeds, then again, go through the keys one at a time and compare the values.
Solution three:
Again, check the dictionaries to see if they are the same size.
If they are, then iterate over the keys of one dictionary and check to see if the key exists in the other dictionary. If it does not, then they are not equal. If it does, then check the corresponding values for equality.
This is O(n) in time and O(1) in space.
How to choose amongst these possible solutions? It depends on what the likely failure mode is, and whether you need to know what the missing or extra key is. If the likely failure mode is a bad key then it might be more performant to choose a solution that concentrates on finding the bad key first, and only checking for bad values if all the keys turn out to be OK. If the likely failure mode is a bad value, then the third solution is probably best, since it prioritizes checking values early.