I’m looking for the fastest way to find all strings in a collection starting from a set of characters. I can use sorted collection for this, however I can’t find convenient way to do this in .net. Basically I need to find low and high indexes in a collection that meet the criteria.
BinarySearch on List<T> does not guarantee the returned index is that of the 1st element, so one would need to iterate up and down to find all matching strings which is not fast if one has a large list.
There are also Linq methods (with parallel), but I’m not sure which data structure will provide the best results.
List example, ~10M of records:
aaaaaaaaaaaaaaabb
aaaaaaaaaaaaaaba
aaaaaaaaaaaaabc
...
zzzzzzzzzzzzzxx
zzzzzzzzzzzzzyzzz
zzzzzzzzzzzzzzzzzza
Search for strings starting from: skk…
Result: record indexes from x to y.
UPDATE: strings can have different lengths and are unique.
In terms of time complexity – you should use a trie, and not a sorted set or binary search.
Trie will get you a
O(|S|)time complexity [while sorted set and binary search gets youO(|S|logn)] to find the node [let it bev] that represents that prefix.All the strings [paths] in the trie that fit the prefix will “pass” via
v. By addingnumberOfLeavesfield to each node, you can find out exactly how much leaves [=strings] this node has.In a single pass – you can also find the index of this
v[For each nodeuin the path from the root tov– sumnumberOfLeavesfor each sibling which is left tou].This requires much more work then using already existing structures, but if the data is huge – it can make your algorithm much faster, so you should concider it if performance is an issue and you expect a huge set of strings.