I am planning to implement the following : Lets consider that I have a dictionary of the following form :
Bob Dylan,
AC / DC,
The Amboy Dukes,
George Thorogood & The Destroyers.
So the dictionary contains 1 token, 2 token and words upto maybe n tokens.
Now, when I have content (a paragraph), I would wanna link content if the word is part of the above dictionary. eg : if my content is of the form:
Bob Dylan was born Robert Allen Zimmerman in St. Mary’s Hospital on
May 24, 1941, in Duluth, Minnesota, and raised in Hibbing, Minnesota,
on the Mesabi Iron Range west of Lake Superior.
In the para, we see that Bob Dylan is used, and Bob Dylan is part of the dictionary. Is there an algorithm out there to do identify this efficiently for millions of records in the dictionary ?
You might be looking for Aho-Corasick string matching algorithm.
The algorithm builds an automaton from your dictionary and looks for matches in a stream of text to this automaton.