I have to implement a set ADT for a pair of strings. The interface I want is (in Java):
public interface StringSet {
void add(String a, String b);
boolean contains(String a, String b);
void remove(String a, String b);
}
The data access pattern has the following properties:
- The
containsoperation is far more frequent that theaddandremoveones. - More often that not,
containsreturnstruei.e. the search is successful
A simple implementation I can think of is to use a two-level hashtable, i.e. HashMap<String, HashMap<String, Boolean>>. But this datastructure makes no use of the two peculiarities of the access pattern. I am wondering if there is something more efficient than the hashtable, maybe by leveraging the access pattern peculiarities.
Do not use normal trees (most standard library data structures) for this. There is one simple assumption, which will hurt you in this case:
The normal
O(log(n))calculation of operations on trees assume that comparisons are inO(1). This is true for integers and most other keys, but not for strings. In case of strings each comparison is onO(k)wherekis the length of the string. This makes all operations dependent on the length, which will most likely hurt you if you need to be fast and is easily overlooked.Especially if you most often return true there will be
kcomparisons for each string at each level, so with this access pattern you will experience the full drawback of strings in trees.Your access pattern is easily handled by a Trie. Testing if a string is contained is in
O(k)worst case (not average case as in a hash map). Adding a string is is also inO(k). Since you are storing two strings I would suggest, you don’t index your trie by characters, but rather by some larger type, so you can add two special index values. One value for the end of the first string, and one value for the end of both strings.In your case using these two extra symbols would also allow for simple removal: Just delete the final node containing the end symbol and your string will not be found anymore. You will waste some memory, because you still have the strings in your structure that have been deleted. In case this is a problem you could keep track of the number of deleted strings and rebuild your trie in case this get’s to bad.
P.s. A trie can be thought of as a combination of a tree and several hashtables, so this gives you the best of both data structures.