Just for fun I would like to count the conditional probabilities that a word (from a natural language) appears in a text, depending on the last and next to last word. I.e. I would take a huge bunch of e.g. English texts and count how often each combination n(i|jk) and n(jk) appears (where j,k,i are sucsessive words).
The naive approach would be to use a 3-D array (for n(i|jk)), using a mapping of words to position in 3 dimensions. The position look-up could be done efficiently using tries (at least that’s my best guess), but already for O(1000) words I would run into memory constraints. But I guess that this array would be only sparsely filled, most entries being zero, and I would thus waste lots of memory. So no 3-D array.
What data structure would be suited better for such a use case and still be efficient to do a lot of small updates like I do them when counting the appearances of the words? (Maybe there is a completely different way of doing this?)
(Of course I also need to count n(jk), but that’s easy, because it’s only 2-D 🙂
The language of choice is C++ I guess.
C++ code:
The dictionary could be a vector of all found words like:
but for better lookup word->index it could be a map:
When you read a new word. You add it to the dictionary and get its index
k, you already haveiandjindexes of the previous two words so then you just do:For better performance you may search for bigram only once:
Is it understandable? Do you need more details?