i am looking for a way to reconcile elements from 3 different sources. i’ve simplified the elements to having just a key (string) and version (long).
the lists are attained concurrently (2 from separate database queries, and 1 from a memory cache on another system).
for my end result, i only care about elements that are not identical versions across all 3 sources. So the result i care about would be a list of keys, with corresponding versions in each system.
Element1 | system1:v100 | system2:v100 | system3:v101 |
Element2 | system1:missing | system2:v200 | system3:v200 |
and the elements with identical versions can be discarded.
The 2 ways of achieving this i thought of are
-
wait for all datasources to finish retrieving, and than loop through each list to aggregate a master list with a union of keys + all 3 versions (discarding all identical items).
-
as soon as the first list is done being retrieved, put it into a concurrent collection such as dictionary (offered in .net 4.0), and start aggregating remaining lists (into the concurrent collection) as soon as they are available.
my thinking is that second approach will be a little quicker, but probably not by much. i can’t really do much until all 3 sources are present, so not much is gained from 2nd approach and contention is introduced.
maybe there is a completely other way to go about this? Also, since versions are stored using longs, and there will be 100’s of thousands (possibly millions) of elements, memory allocation could be of concern (tho probably not a big concern since these objects are short lived)
HashSet is an option as it has Union and Intersect methods
HashSet.UnionWith Method
To use this you must override Equals and GetHashCode.
A good (unique) hash is key to performance.
If the version is all v then numeric the could use the numeric to build the hash with missing as 0.
Have Int32 to play with so if version is Int10 or less can create a perfect hash.
Another option is ConcurrentDictionary (there is no concurrent HashSet) and have all three feed into it.
Still need to override Equals and GetHashCode.
My gut feel is three HashSets then Union would be faster.
If all versions are numeric and you can use 0 for missing then could just pack into UInt32 or UInt64 and put that directly in a HashSet. After Union then unpack. Use bit pushing << rather than math to pack an unpack.
This is just two UInt16 but it runs in 2 seconds.
This is going to be faster than Hashing classes.
If all three versions are long then HashSet
<integral type>will not be an option.long1 ^ long2 ^ long3; might be a good hash but the is not my expertise.
I know GetHashCode on a Tuple is bad.
Tested using a ConcurrentDictionary and the above was over twice as fast.
Taking locks on the inserts is expensive.