Remark: I know there are many similar questions on SO, but none specific to the C language, hence why I am asking this.
Here’s the problem I am facing: I will be provided a large text (e.g., 150,000 words) and after that a series of phrases (each phrase has from 1 up to 10 words). For each of those phrases I need to find the word that immediately follows the phrase in the text and return it.
My only idea to solve it so far: create a struct that holds:
- the current word
- the 3 words that preceded that word
- the word that follows
Then I would parse the text creating one struct for each word, and store all those structs on a hash table. As each phrase comes along I would search on the hash table for the last word of that phrase, check if the previous 3 words match, and then return the next word. I believe going to back to 3 words would be enough to uniquely identify phrases, but I could increase that number.
Do you think this would work? Do you know a better way?
Much easier approach: run through the text, storing all n-grams (subsequences of n words) for 1 <= n <= 10 in a hash table or trie. Retrieval is then trivial, just look up the n-gram in the hash table or trie.
In the hash table version, you’d just store the n-grams as concatenations of word strings with normalized space in between.
The problem with this approach is that with a hash table, you’ll need up to 45 * N entries, where N is the number of words in the text. Lookup should be very fast, though, and 150.000 words is a small enough dataset to make this work.