I’m building an email filter and I need a way to efficiently match a single email to a large number of filters/rules. The email can be matched on any of the following fields:
- From name
- From address
- Sender name
- Sender address
- Subject
- Message body
Presently there are over 5000 filters (and growing) which are all defined in a single table in our PostgreSQL (9.1) database. Each filter may have 1 or more of the above fields populated with a Python regular expression.
The way filtering is currently being done is to select all filters and load them into memory. We then iterate over them for each email until a positive match is found on all non-blank fields. Unfortunately this means for any one email there can potentially be as many as 30,000 (5000 x 6) re.match operations. Clearly this won’t scale as more filters get added (actually it already doesn’t).
Is there a better way to do this?
Options I’ve considered so far:
-
Converting saved python regular expressions to POSIX style ones to make use of PostgreSQL’s SIMILAR TO expression. Will this really be any quicker? Seems to me like it’s simply shifting the load somewhere else.
-
Defining filters on a per user basis. Though this isn’t really practical because with our system users actually benefit from a wealth of predefined filters.
-
Switching to a document-based search engine like elastic search where the first email to be filtered is saved as the canonical representation. By finding similar emails we can then narrow down to a specific feature set to test on and get a positive match.
-
Switching to a bayes filter which would also give us some machine learning capability to detect similar emails or changes to existing emails that would still match with a high enough probability to guess that they were the same thing. This sounds cool but I’m not sure it would scale particularly well either.
Are there other options or approaches to consider?
The trigram support in PostgreSQL version 9.1 might give you what you want.
http://www.postgresql.org/docs/9.1/interactive/pgtrgm.html
It almost certainly will be a viable solution in 9.2 (scheduled for release in summer of 2012), since the new version knows how to use a trigram index for fast matching against regular expressions. At our shop we have found the speed of trigram indexes to be very good.
Also, if you ever want to do a “nearest neighbor” search, where you find the K best matches based on similarity to a search argument, a trigram index is wonderful — it actually returns rows from the index scan in order of “distance”. Search for KNN-GiST for write-ups.