Table1
- ID INT
- Tags TEXT
Table2
- ID INT
- Tags TEXT
The tags field could look like this "abc def hij 123". Tags being space delimited. Each record could have in excess of 200 tags. (Tags are being defined "on the fly" )
Given a record from table1 I want to find the "most suitable" record from table2 where the tags in the row from table1 match tags in row from table2.
MySQL FULL text search seems like the best thing to use for this.
Table2 should only have around 800-1000 rows – so not much overhead there. But Table1 might have 20 million, and I may want to, in the future, do the reverse (find best match from table1 for a row in table2).
Question:
Do you think FULL TEXT search is the best thing to use here? If not what would could be an alternative?
I have looked into XML databases, and they promising (especially Xbase)… but do I feel confident to put that database live on a production machine? Not yet… (or should I?)
Full text search will not help you because neither your needle nor your haystack are normalized. If you had only a single tag (the needle) to search for in a de-normalized list (the haystack), FTS could help you. But, instead, you need to first normalize the list of search tags into a bunch of separate needles, then search for each one in the haystack.
You’re much better off just normalizing the data in the first place (separate tag tables of the form (ID, Tag)) and using JOIN to determine how many points of commonality there are.
On further consideration I would suggest a single TaggedItems table, with a structure like this:
(TAG TEXT(3), ID1 INTEGER, ID2 INTEGER). When you want to tag table 1 you would issue an INSERT OR UPDATE (or MySQL equivalent) for the tag and the ID1 column. The same for table 2 and the ID2 column. Now, you can retrieve a similarity rating by selecting the count of records in this table where ID1 = the ID1 value you’re interested in and ID2 is NOT NULL, GROUPed BY ID2. No JOIN required.