There’s a lot of software that will take a search string and find all

Question

0

Asked: June 1, 20262026-06-01T15:41:46+00:00 2026-06-01T15:41:46+00:00

There’s a lot of software that will take a search string and find all

0

There’s a lot of software that will take a search string and find all of the text in your database that contains it (MySQL’s WHERE MATCH('searchterm', string_column), Google, etc.), but is there a good algorithm for going the other way?

Say I have a list of search terms:

Toyota Prius, Toyota Tacoma, Honda Civic, Chevy Nova, Chevy Volt

And I have a string, like:

1962 Chevy Nova convertable

Is there a good algorithm where I can put the list and the string in, and get Chevy Nova out?

If they’re all easily tokenized, I could tokenize them and do an inner join, but I’m interested in the case where I can’t tell which part of the input string is the “important” part.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-01T15:41:47+00:00

if you’re tokenizing the “1962 Chevy Nova convertable” [sic] you’ll end up with four tokens that are all important or interesting enough to care about. if you’re keeping track of all of the possible words in your language, you’ll have an index for each of those words.

and on the other hand, you’ve got your search terms. in each of those cases, you’ve tokenized and indexed the interesting words. each of those can be though of as a pair of two token indexes.

then if you take your input and look for search terms that match, you’ll be asking which of the search terms have any of the words of the input?

since I’m a database guy at heart, I can imagine creating the token list like so:

CREATE TABLE aa_tokens (
  id INT NOT NULL AUTO_INCREMENT PRIMARY KEY ,
  word VARCHAR( 40 ) NOT NULL 
);

insert into aa_tokens (word) values
  ('1962'),           -- 1
  ('Chevy'),          -- 2
  ('Civic'),          -- 3
  ('Honda'),          -- 4
  ('Nova'),           -- 5
  ('Prius'),          -- 6
  ('Tacoma'),         -- 7
  ('Toyota'),         -- 8
  ('Volt'),           -- 9
  ('convertable');    -- 10

and a table of searches so that each can have an id:

CREATE TABLE aa_search (
  id INT NOT NULL AUTO_INCREMENT PRIMARY KEY ,
  text VARCHAR( 255 ) NOT NULL
);

insert into aa_search (text) values
  ('Toyota Prius'),   -- 1
  ('Toyota Tacoma'),  -- 2
  ('Honda Civic'),    -- 3
  ('Chevy Nova'),     -- 4
  ('Chevy Volt');     -- 5

and then a table combining the searches and tokens:

CREATE TABLE aa_searchToks (
  search INT NOT NULL,
  token INT NOT NULL
);

insert into aa_searchToks (search, token) values
  (1, 8),
  (1, 6),
  (2, 8),
  (2, 7),
  (3, 4),
  (3, 3),
  (4, 2),
  (4, 5),
  (5, 2),
  (5, 9);

now if we take the input string “1962 Chevy Nova convertable” and turn it into tokens (1, 2, 5, 10), we can make a query that looks at the tokens of the search terms:

select search, count(*) from aa_searchToks
  where token in (1, 2, 5, 10) group by search;

the result of which is:

+--------+----------+
| search | count(*) |
+--------+----------+
|      4 |        2 |
|      5 |        1 |
+--------+----------+

or querying a little bit differently:

select search, (select text from aa_search s where st.search = s.id) as text, 
  count(*) from aa_searchToks st where token in (1, 2, 5, 10) group by search;

resulting in:

+--------+------------+----------+
| search | text       | count(*) |
+--------+------------+----------+
|      4 | Chevy Nova |        2 |
|      5 | Chevy Volt |        1 |
+--------+------------+----------+

we can see that “Chevy Nova” matches two tokens and is the best match, which, of course, it is.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

There’s a lot of software that will take a search string and find all

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply