What I am trying to implement is a rather trivial “take search results (as in title & short description), cluster them into meaningful named groups” program in PHP.
After hours of googling and countless searches on SO (yielding interesting results as always, albeit nothing really useful) I’m still unable to find any PHP library that would help me handle clustering.
- Is there such a PHP library out there that I might have missed?
- If not, is there any FOSS that handles clustering and has a decent API?
Like this:
Use a list of stopwords, get all words or phrases not in the stopwords, count occurances of each, sort in descending order.
The stopwords needs to be a list of all common English terms. It should also include punctuation, and you will need to preg_replace all the punctuation to be a separate word first, e.g. “Something, like this.” -> “Something , like this .” OR, you can just remove all punctuation.
Now you have an associative array in order of the frequency of terms that occur in your input data.
How you want to do the matches depends upon you, and it depends largely on the length of the strings in the input data.
I would see if any of the top 3 array keys match any of the top 3 from any other in the data. These are then your groups.
Let me know if you have any trouble with this.