As part of a document reader I’m writing for iPhone/iPad, I need the following functionality:
Search through a document of between appx 500 and 10000 words for words and phrases that appear in one of several lists. Each list contains between 100 and 5000 words and phrases. When I find a word in the document that appears in one of those lists, I mark it and move on.
I will know the word lists ahead of time, but the documents will be unknown until the moment they need to be processed.
And this needs to be VERY FAST.
Any help would be greatly appreciated!
This presentation and paper present a fast multi-pattern string search algorithm. It also mentions some predecessors, should this one not fit your needs.
Multifast is an open source (LGPLed) C library that implements the Aho-Corasick algorithm.