I have a PHP/MySQL driven site, which I have not maintained for the past 6 months. It is a site where users come and submit their articles. I have 50.000 articles and by some ‘ad hoc’ tests I should say that about 50-60% is spam and copy pasted text from other sites.
I am looking to write a PHP script that will take some base parameters to mark/remove spam text(not copy/pasted, for this step only pure spam) so my idea is to make a script which takes every unit, counts characters, words, different words and phrases usage and word density and depending on those factors remove as pure spam (with much repeated phrases, etc.). So for this I will lose a whole day and my question is:
Is there some solution already developed in PHP?
If I need to code it myself, what parameters on determining spam should I use?
Here’s a PHP class that I’ve used in the past – Basic Spam Class
I am not the author, so I don’t take any responsibility for potential damage done by the code. I’ve used it for checking short texts though – user comments on a site, so I’m not sure about the performance on 50k of long articles, maybe you will need to do some enhancements on it. But at least you have something to start from.