I need some help with this issue:
As an input I have a string, which looks like Blue cat green eyes 2342342, or it can be Cat blue eyes green 23242 or any other permutation of words.
In my DB table I have some data. One of the columns is called, say, keyWords.
Here is an example of this table:

My task is to find record in my DB table column, KEYWORDS, which matches some words from the input string.
For example: for strings “Blue cat green eyes 2342342″ “Cat blue eyes green 23242″ and “Cat 23242 eyes blue green” the result must be “blue cat” (first row of my table).
The only way I can imagine how to solve this task looks like this:
- Consistently take every word from the string.
- Search this every word with
%like%in a table column. - If it is not found it means this word is not key and we have no interest in it.
- If it is found one time – great! No doubt, this is what we are looking for.
- If there are more than one result:
- From all the words from the string, which were not tested yet consistently take every word.
- Search this word with
%like%in the results from step 2. - etc…
Graphical schema of this algorithm is here
But it looks like this algorithm will work very slowly if there are a lot of records in a table and if my input string consists of big number of words.
So, my question is: Is there are any special algorithms which can help solving this task?
You can adopt another table such as
and transform the string
in a series of indexes and counts:
This would perform a series of exact matches and return, say,
Then you know that keyword group with id 1 has two words, which means that a count of 2 matches all of them. So keywordid 1 is satisfied. Group 2 has also two words (black, cat) but only one was found, and the match is there but not complete.
If you also record the keyword set size together with keyword ID, then all keywords from the same ID will have the same KeywordSize, and you can GROUP BY it too:
and can even
SELECT COUNT(*)/KeywordSize AS match ... ORDER BY matchand have keyword matches sorted by relevancy.Of course, once you have KeywordID, you can find it in the keywords table.
Implementation
You want to add the keyword list “black angry cat” to your existing table.
So you explode this keyword list into words: and get “black”, “angry” and “cat”.
You insert the keyword list normally in the table that you already have, and retrieve the ID for that newly created row, let’s say it is 1701.
Now you insert the words into a new table that we call “ancillary”. This table only contains the keyword row ID of your primary table, the single word, and the size of the word list from which that word comes.
We know we are inserting 3 words in all, for table row 1701, so size=3 and we insert these tuples:
(These will receive an unique ID of their own, but this does not concern us).
Now some time later we receive a sentence which is,
We could first run the query against a list of null-words to be removed, such as “is” and “and”. But this is not necessary.
Then we could run as many queries as there are words, and thereby discover that no rows anywhere contained “Schroedinger” and we can drop it. But this, too, is not necessary.
Finally we build the real query against ancillary:
The
WHEREwill return, say, these rows:So the GROUP will return the
KeywordIDof these rows with its cardinality:Now you can sort by matching ratio descending, and then by list size descending (since matching 100% of 3 words is better than matching 100% of 2, and matching 1 in 2 is better than matching 2 in 3):
You can also retrieve your first table in one query, with added match ratio:
The largest cost is for the exact match in “ancillary” which has to be indexed on the
Wordcolumn.