I’m trying to code up a natural language parser and search engine in PHP. All of the ways that I have thought of thus far have been either cumbersome to implement, use, or not that efficient.
One of my ideas included a script that would perform regular expression on a simplified string, ie. various words removed from the string, and then the resulting string checked first for what the user is looking for – ie, “opening times”, then if possible the venue they’re searching for – lets say “Derngate”. The rest is similar to that.
Can anyone point me in the direction of a more efficient way of doing things? I don’t want to be doing 25 different regular expressions – or what ever the count is – per each page load if I can help it.
Many thanks!
Edit: I’m just curious, that’s all. I’d rather make my own (to see how it works) rather than jumping into something like Lucene.
I think that after a review of the state of the art, I’d look at root/stem word extraction as a start. (Not too heavy a task if your document corpus is relatively static, since this can be done at document-capture time.)
There’s a PHP extension for that, stem. http://pecl.php.net/package/stem
There’s the Porter Stemmer implemented in PHP, that’s the key operation in the above, implemented as a function.