I was asked the question in an interview. The interviewer told me to assume that there exists a function say getNextWord() to return the next word in a given document. My task was to design a data structure to implement the task, and give an algorithm that constructs a list of all words with their frequencies.
Being from a C++ background, my answer was to create a multimap of string and then insert all words in it, and later display the count of it. I was however told later, to do this in a more generic way. By generic he meant that he didn’t want me to use a library feature. Also I guess a multimap is implemented internally as a 2-3 tree or so, so for the multimap solution to be generic I would need to code the 2-3 tree as well.
Although tries did come to mind, implementing one during an interview was out of question for me. So, I just wanted to know if there are better ways of achieving it? Or is there a way to implement it in a smooth manner using tries?
Any histogram based algorithm would be both effient and generic in here. The idea is simple: build a histogram from the data. A generic interface for a histogram is a
Map<String,Integer>Iterate the document once (with your nextDoc() method), while maintaining your histogram.
The best implementation for this interface, in terms of big O notations – would probably be to use a trie, and in each leaf node – add the counter of occurances.
Getting the actual
(word,number)pairs from the trie will be done by a simple DFS on the trie.This solution gives you
O(n * |S|)time complexity, where |S| is the average size for a string.The insertion algorithm for each word:
Each time you add a new word: check if it already exists, if it does – increase the counter, else – add the word to the dictionary with a counter value of 1.