There’s a lot of software that will take a search string and find all of the text in your database that contains it (MySQL’s WHERE MATCH('searchterm', string_column), Google, etc.), but is there a good algorithm for going the other way?
Say I have a list of search terms:
Toyota Prius, Toyota Tacoma, Honda Civic, Chevy Nova, Chevy Volt
And I have a string, like:
1962 Chevy Nova convertable
Is there a good algorithm where I can put the list and the string in, and get Chevy Nova out?
If they’re all easily tokenized, I could tokenize them and do an inner join, but I’m interested in the case where I can’t tell which part of the input string is the “important” part.
if you’re tokenizing the “1962 Chevy Nova convertable” [sic] you’ll end up with four tokens that are all important or interesting enough to care about. if you’re keeping track of all of the possible words in your language, you’ll have an index for each of those words.
and on the other hand, you’ve got your search terms. in each of those cases, you’ve tokenized and indexed the interesting words. each of those can be though of as a pair of two token indexes.
then if you take your input and look for search terms that match, you’ll be asking which of the search terms have any of the words of the input?
since I’m a database guy at heart, I can imagine creating the token list like so:
and a table of searches so that each can have an id:
and then a table combining the searches and tokens:
now if we take the input string “1962 Chevy Nova convertable” and turn it into tokens (1, 2, 5, 10), we can make a query that looks at the tokens of the search terms:
the result of which is:
or querying a little bit differently:
resulting in:
we can see that “Chevy Nova” matches two tokens and is the best match, which, of course, it is.