Say I have 100 keywords (that can include spaces) and I need to find out how many times they occur in a big piece of text. What would the fast way be to accomplish this?
My current idea is as follows:
- turn the keywords into a suffix tree
- walk through the text following the nodes and whenever a char does not occur (i.e. node->next == NULL) in the suffix tree, skip to next word and search again
The suffix tree struct would look something like this:
struct node {
int count; //number of occurences (only used at leaf node)
/* for each lower-case char, have a pointer to either NULL or next node */
struct node *children[26];
};
I am sure there is a faster way to do this, but what is it? Space efficiency is not really a big deal for this case (hence the children array for faster lookup), but time efficiency really is. Any suggestions?
The problem with the suffix tree approach is that you have to start the suffix search for each letter of the text to be searched. I think the best way to go would be to arrange a search for each keyword in the text, but using some fast search method with precomputed values, such as Boyer-Moore.
EDIT:
OK, You may be sure the trie may be faster. Boyer-Moore is very fast in the average case. Consider, for example, that strings have a mean length of m. BM can be as fast as O(n/m) for “normal” strings. That would make 100*O(n/m). The trie would be O(n*m) in mean (but it is true it can be much faster in real life), so if 100 >> m then the trie would win.
Now for random ideas on optimization. In some compression algorithms that have to do backward searchs, I’ve seen partial hash tables indexed by two characters of the string. That is, if the string to check is the sequence of characters
c1,c2, andc3, you can check wether:then for
c2andc3, and so on. It is surprising how many cases you avoid by doing this simple check, as this hash will be true only each 100/65536 times (0.1%).