Say I have 100 keywords (that can include spaces) and I need to find

Question

0

Asked: May 24, 20262026-05-24T17:17:25+00:00 2026-05-24T17:17:25+00:00

Say I have 100 keywords (that can include spaces) and I need to find

0

Say I have 100 keywords (that can include spaces) and I need to find out how many times they occur in a big piece of text. What would the fast way be to accomplish this?

My current idea is as follows:

turn the keywords into a suffix tree
walk through the text following the nodes and whenever a char does not occur (i.e. node->next == NULL) in the suffix tree, skip to next word and search again

The suffix tree struct would look something like this:

struct node {
   int count; //number of occurences (only used at leaf node)
   /* for each lower-case char, have a pointer to either NULL or next node */
   struct node *children[26];
};

I am sure there is a faster way to do this, but what is it? Space efficiency is not really a big deal for this case (hence the children array for faster lookup), but time efficiency really is. Any suggestions?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-24T17:17:26+00:00

The problem with the suffix tree approach is that you have to start the suffix search for each letter of the text to be searched. I think the best way to go would be to arrange a search for each keyword in the text, but using some fast search method with precomputed values, such as Boyer-Moore.

EDIT:

OK, You may be sure the trie may be faster. Boyer-Moore is very fast in the average case. Consider, for example, that strings have a mean length of m. BM can be as fast as O(n/m) for “normal” strings. That would make 100*O(n/m). The trie would be O(n*m) in mean (but it is true it can be much faster in real life), so if 100 >> m then the trie would win.

Now for random ideas on optimization. In some compression algorithms that have to do backward searchs, I’ve seen partial hash tables indexed by two characters of the string. That is, if the string to check is the sequence of characters c1, c2, and c3, you can check wether:

if (hash_table[c1 * 256 + c2] == true) check_strings_begining with [c1,c2]

then for c2 and c3, and so on. It is surprising how many cases you avoid by doing this simple check, as this hash will be true only each 100/65536 times (0.1%).

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Say I have 100 keywords (that can include spaces) and I need to find

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply