I need to match input strings (URLs) against a large set (anywhere from 1k-250k) of string rules with simple wildcard support.
Requirements for wildcard support are as follows:
Wildcard (*) can only substitute a “part” of a URL. That is fragments of a domain, path, and parameters. For example, “*.part.part/*/part?part=part&part=*”. The only exception to this rule is in the path area where “/*” should match anything after the slash.
Examples:
- *.site.com/* — should match sub.site.com/home.html, sub2.site.com/path/home.html
- sub.site.*/path/* — should match sub.site.com/path/home.html, sub.site.net/path/home.html, but not sub.site.com/home.html
Additional requirements:
- Fast lookup (I realize “fast” is a relative term. Given the max 250k rules, still fall within < 1.5s if possible.)
- Work within the scope of a modern desktop (e.g. not a server implementation)
- Ability to return 0:n matches given a input string
- Matches will have rule data attached to them
What is the best system/algorithm for such as task? I will be developing the solution in C++ with the rules themselves stored in a SQLite database.
If I’m not mistaken, you can take string rule and break it up into domain, path, and query pieces, just like it’s a URL. Then you can apply a standard wildcard matching algorithm with each of those pieces against the corresponding pieces from the URLs you want to test against. If all of the pieces match, the rule is a match.
Example
Rule: *.site.com/* domain => *.site.com path => /* query => [empty] URL: sub.site.com/path/home.html domain => sub.site.com path => /path/home.html query => [empty] Matching process: domain => *.site.com matches sub.site.com? YES path => /* matches /path/home.html? YES query => [empty] matches [empty] YES Result: MATCHAs you are storing the rules in a database I would store them already broken into those three pieces. And if you want uber-speed you could convert the
*‘s to%‘s and then use the database’s nativeLIKEoperation to do the matching for you. Then you’d just have a query likewhere
@urlDomain,@urlPath, and@urlQueryare variables in a prepared statement. The query would return the rules that match a URL, or an empty result set if nothing matches.