I want to make a system for URL matching. It will work in this way:
The database will contains many patterns. and some metadata of the pattern like this:
pattern1, keyword
pattern2, keyword
...
...
I have a input URL. like htttp://example.com/blabla/111/2222/detail.htm
The system will get the input and output the keyword of the most matched pattern for the input URL. There will be more than 20,000 requests per second.
The thing we need to design is the pattern and the database model. I’ve spend over 2 weeks in this system.
I’m thinking about match the URL in a tree.
All the nodes in the tree are able to do 2 kinds of output: Which node should continue matching the URL, or the node know which keyword should be applied to the URL.
Each node will be connected with a callback(a script stored in db). So different node will have different behavior.
But the thing we have is tons of patterns. I think I need to have a facility to convert patterns into thos “nodes”. Or at least can build a tree with existing nodes with the patterns in the db.
I’m still thinking about the tree generating. But there should be some better way.
Any ideas will be very helpful. Thank you!!!
You need one of the industrial-strength string matching algorithms: http://en.wikipedia.org/wiki/String_searching_algorithm. I don’t think a database-backed approach will work so well because it sounds like you need pattern matching, not exact prefix matching.
But if you are using prefix matching (the longest match from the beginning), then you can use a prefix trie, a trie. If I were you, I would use the database as a persistent store, but keep my matching trie in memory.