I’m making a profanity filter (bad idea I know), and I’m trying to do it with regex in Java.
Right now here’s my regex example string, this would filter 2 words, foo and bar.
(?i)f(?>[.,;:*`'^+~\\/#|]*+|.)o(?>[.,;:*`'^+~\\/#|]*+|.)o|b(?>[.,;:*`'^+~\\/#|]*+|.)a(?>[.,;:*`'^+~\\/#|]*+|.)r
Basically, I have it ignore case, then I put (?>[.,;:*'^+~\\/#|]*+|.) in between each letter of a curse word, and | between each complete curse word regex.
It works, but it’s sorta slow.
If I have 6 words in the filter, it will filter a fairly long string (500 characters) in 939,548 nanoseconds. When I have 12, it just about doubles.
So, about 1ms per 6 curse words with this. But my filter will have hundreds (400 or so).
Calculating this, it would take about 66ms to filter this long string.
This is a chat server I’m building, and if I have lots of users on (say, 5,000) and 1 out of 5 are chatting in 1 second (1,000 chat messages) I need to filter a message in about 1ms.
Am I asking too much of regexps? Would it be faster to make my own specialized type of filter by hand? Are there ways to optimize this?
I am precompiling the regex.
If you want to see the effect of this regex http://regexr.com?30454
Update: Another thing I could to is have chat messages filtered client side in actionscript.
Update: I believe the only way to achieve such degree of performance would be a hand-coded solution without using regexps sadly, so I’ll have to do a more basic filter.
To answer your question “am I asking too much of regexps?”- Yes
I spent the better part of 2 years working on a profanity filter using regular expressions and finally gave up. During this time, I tried all of these things:
Nothing worked well and as my blacklist grew my system slowed down. In the end I gave up and implemented a linear analysis filter, which is now the core part of CleanSpeak, my company’s profanity filtering product.
We found that we were also able to do some great multi-threading and other optimizations once we stopped using regexps and went from handling 600-700 messages per second to 10,000+ messages per second.
Lastly, we also found that performing linear analysis made the filter more accurate and allowed us to solve the “scunthrope problem” and many of the other ones people have mentioned in the comments here.
You can definitely try all of the things I mention above and see if you can get your performance up, but it is a hard problem to solve because regexps weren’t really designed for language analysis. They were designed for text analysis, which is a very different problem.