Ever few minutes around 500 paragraphs are supposed to be submitted to the database in a table called “Content” (this number will go to over 2,500 in a few months).
There is another table called “Keywords” which has over 4,000 rows (and is expected to grow to over 10,000).
Keywords
+------------+-------------------+
| Keyword_id | keyword |
+------------+-------------------+
| 1 | "Venture Capital" |
| 2 | "Financing" |
+------------+-------------------+
The question is: What is the best way to scale a solution where each keyword is cross-referenced among an incoming paragraphs of text to see if there is a match?
Since I’m not concerned about where in the paragraph there is a match (my only concern is that there IS a match);
if(preg_match()){} could possibly work but even at the low-end that’s 2,000,000 times you’re running over a paragraph searching for a keyword.
Plus, correct me if I’m wrong, preg_match is pretty expensive.
One of the possibilites that crossed my mind was to keep an array of the keywords in the cache instead of having to call on the DB for every row.
That would definitely help speed things up I think.
I’m not concerned with this being only in PHP.
If this section of the application needs to be in Python (correct me if I’m wrong, but I hear Python is a lot less expensive at parsing text), then I’m all ears.
With MySQL:
Search query:
Vent CapitUsing match against:
If your using
_cicollation, (ci stands for case insensitive), the matching would ignore capitalization 🙂