I have a database table with around 1000 keywords/phrases (one to four words long) – This table changes rarely, so I could extract the data into something more useful (like a regular expression?) – So this is not finding / guessing at keywords based on natural language processing..
I then have a user inputting some text into a form that I’d like to match against my keywords and phrases.
The program would then store a link to each phrase matched next to the text.
So if we ran the algorithm on this question text against a few phrases that are in here, we’d get a result like so:
{"inputting some text" : 1,
"extract the data" : 1,
"a phrase not here" : 0}
What are my options?
- Compile a regular expression
- Some sort of SQL query
- A third way?
Bearing in mind that there’s a 1000 possible phrases..
I’m running Django / Python with MySQL.
edit: I’m currently doing this:
>>> text_input = "This is something with first phrase in and third phrase"
>>> regex = "first phrase|second phrase|third phrase"
>>> p = re.compile(regex, re.I)
>>> p.findall(text_input)
['first phrase','third phrase']
The algorithm for this job is Aho-Corasick … see the link at the bottom whch points to a C-extension for Python.