To be short:
JAVA/Hibernate/AJAX/SpringMVC
I would like that every comment which is posted by a user should be read on the server side before storing it into the database and reject the comment if it contains an offensive text.
The offensive text list is quite huge (maybe thousands). look at this example list: http://onlineslangdictionary.com/lists/most-vulgar-words/
I guess that iterating this list and execute a function like the following is not so fast. Is there any other way to do this filter more faster?
Do you think search over thousandths of items will have a big impact over resources CPU/RAM? Any suggestion is welcomed!
for(String offensiveText : offensiveTextList ){
if(commentText.contains(offensiveText )){
//reject comment
}
}
Update:
The offensive item list can contain items composed by a few words inside it (like a 3 words text, and can contain stop words).
It can contain even non alphabet characters like *&^%.
If the comment contains the respective offensive item (exactly same letters) then it is considered rejected
You would probably need to use some natural language processing library for this. If you are going to compare every M word from a comment with N offensive words from a list, then your algorithm complexity is going to be
O(MN) = O(N^2), which is quite high.Take a look at the Lucene stack, you may find some really good ideas, for example how to tokenize a comment and reduce the input by removing meaningless words.
Also take a look at the thesis: “Distinguishing Between Factual Information and Insulting or Abusive Messages bearing Words or Phrases in News Articles”