This is a design question.
Background: We get a web request into our system from many different websites (for a widget that we give out), from which we grab the referrer string (if it exists). We use the referrer to decide on some things within the application. The problem arises in that I need to look at a list of “sites” (urls, partial urls, urls containing wildcards) in order to determine what to do. This list could be on the order of many thousands of sites. I need to be able to ask something like a “Site Service” (or whatever) if the referrer is a match with anything in the site list. I need to do this fast, say 5-10ms, give or take a few ms, and get a positive or negative result back.
Here is a basic example:
Request – Referrer = http://www.stackoverflow.com/users/120262?tab=accounts
Site List Could Contain urls like:
users.stackoverflow.com— (not a match)www.stackoverflow.com/users— (match)www.stackoverflow.com/users/120262— (match)www.stackoverflow.com/users/*— (match)*/users/*— (match)www.stackoverflow.com/users/239289— (not a match)*.stackoverflow.com/questions/ask— (not a match)*/questions/*— (not a match)www.stackoverflow.com— (match)www.msdn.com— (not a match)*.msdn.com— (not a match)developer.*.com— (not a match)
You get the idea…
The issue I am dealing with is how to handle this in a performant and scalable way.
Performant in that I need to make a decision fast so that I can move on to the real processing that needs to happen.
Scalable in that the list of thousands of “sites” is setup for each affiliate that we have and they each may have many site lists, making for thousands of site lists containing thousands of sites.
I’m willing to consider pretty much anything here as I am just in the initial (re)design phase of this. Any and all thoughts are welcome including solution suggestions, general patterns to look into, existing tools even.
Thanks.
This is a partial answer, assuming that your patterns you are trying to match against are all either constant strings with no wildcards in them, or a sequence of strings separated by wilcards “*” that can match any string.
This problem has been studied quite a bit in the context of implementing network-based and host-based intrusion detection systems, where you have a bunch of patterns you are looking for in network traffic, where each pattern might be a sign of an intruder sending attack traffic at you.
In the special case where there are no wildcards at all in the patterns, and your set of patterns is changing infrequently, so you can afford to spend some time doing some precomputation of data structures when they change, a well-known way to do this is the Aho-Corasick algorithm:
http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
If you then want to generalize to allow wildcards, the following ideas might not have good worst-case performance, but would likely perform well in practice. Break up patterns that have wildcards in them into the “constant string” parts, e.g. break up the pattern “developer..com” into “developer.” and “.com”. Put those two strings in the list of ones you are searching for separately. Only if a URL coming in matches both developer. and .com would you then do some more post-processing to make sure it had them both in the desired order (as opposed to in the opposite order, like “a.com.developer.foo” would, and should thus not match the pattern “developer..com”).
For large sets of patterns, Aho-Corasick can require lots of memory to store the state-machine that it represents. There have been other similar methods designed later to improve on it. For example, Google for the paper title “Advanced Algorithms for Fast and Scalable
Deep Packet Inspection” by Kumar, Turner, and Williams.
I am aware other methods of solving this, too, which are patented by Cisco Systems. If there is any chance your company would license these methods, or already has some kind of bulk cross-licensing agreement with Cisco, I’d be happy to tell you more about those.