A friend of mine was asked this problem in an interview. I would like to discuss this problem here
What can be the efficient implementation for this problem ?
A simple idea which comes to me is normal memqueue , using Memcache machines to scale several requests, with a consumer job running which will write things from memcache to DB.
and later on for the second part we can just run a sql query to find list of matching subscribers .
PROBLEM:-
Events get published to this system. Each event can be thought of as containing a fixed number (N) of string columns called C1, C2, … CN. Each event can thus be passed around as an array of Strings (C1 being the 0th element in the array, C2 the 1st and so on).
There are M subscribers – S1, … SM
Each subscriber registers a predicate that specifies what subset of the events it’s interested in. Each predicate can contain:
Equality clause on columns, for example: (C1 == “US”)
Conjunctions of such clauses, example:
(C1 == “IN”) && (C2 == “home.php”)
(C1 == “IN”) && (C2 == “search.php”) && (C3 == “nytimes.com”)
(In the above examples, C1 stands for the country code of an event and C2 stands for the web page of the site and C3 the referrer code.)
ie. – each predicate is a conjunction of some number of equality conditions. Note that the predicate does not necessarily have an equality clause for ALL columns (ie. – a predicate may not care about the value of some or all columns). (In the examples above: #a does not care about the columns C3, … CN).
We have to design and code a Dispatcher that can match incoming events to registered subscribers. The incoming event rate is in millions per second. The number of subscribers is in thousands. So this dispatcher has to be very efficient. In plain words:
When the system boots, all the subscribers register their predicates to the dispatcher
After this events start coming to the dispatcher
For each event, the dispatcher has to emit the id of the matching subscribers.
In terms of an interface specification, the following can be roughly spelt out (in Java):
Class Dispatcher {
public Dispatcher(int N /* number of columns in each event – fixed up front */);
public void registerSubscriber( String subscriberId /* assume no conflicts */,
String predicate /* predicate for this subscriberid */);
public List<String> findMatchingIds(String[] event /* assume each event has N Strings */);
}
Ie.: the dispatcher is constructed, then a bunch of registerSubscriber calls are made. After this we continuously invoke the method findMatchingIds() and the goal of this exercise is to make this function as efficient as possible.
As Hanno Binder implied, the problem is clearly set up to allow pre-processing the subscriptions to obtain an efficient lookup structure. Hanno says the lookup should be a map
When an event arrives, just look up all the applicable sets and find their intersection. A lookup failure returns the empty set. I’m only recapping Hanno’s fine answer to point out that a hash table is O(1) and perhaps faster in this application than a tree. On the other hand, intersecting trees can be faster, O(S + log N) where S is the intersection size. So it depends on the nature of the sets.
Alternative
Here is my alternative lookup structure, again created only once during preprocessing. Begin by compiling a map
There is also a distinguished token
0that stands for “don’t care.”Now every predicate can be thought of as a regular expression-like pattern with N tokens, either representing a specific event string key or “don’t care.”
We can now build a decision tree in advance. You can also think of this tree is a Deterministic Finite Automaton (DFA) for recognizing the patterns. Edges are labeled with tokens, including “don’t care”. A don’t care edge is taken if no other edge matches. Accepting states contain the respective subscriber set.
Processing an event starts with converting the keys to a token pattern. If this fails due to a missing map entry, there are no subscribers. Otherwise feed the pattern to the DFA. If the DFA consumes the pattern without crashing, the final state contains the subscriber set. Return this.
For the example, we would have the map:
For N=4, the DFA would look like this:
Note that since there are no subscribers who don’t care about e.g. C1, the starting state doesn’t have a don’t care transition. Any event without “IN” in C1 will cause a crash, and the null set will be properly returned.
With only thousands of subscribers, the size of this DFA ought to be reasonable.
Processing time here is of course O(N) and could be very fast in practice. For real speed, the preprocessing could generate and compile a nest of C switch statements. In this fashion you might actually get millions of events per second with a small number of processors.
You might even be able to coax a standard tool like the flex scanner generator to do most of the work for you.