A user visits my website at time t, and they may or may not click on a particular link I care about, if they do I record the fact that they clicked the link, and also the duration since t that they clicked it, call this d.
I need an algorithm that allows me to create a class like this:
class ClickProbabilityEstimate {
public void reportImpression(long id);
public void reportClick(long id);
public double estimateClickProbability(long id);
}
Every impression gets a unique id, and this is used when reporting a click to indicate which impression the click belongs to.
I need an algorithm that will return a probability, based on how much time has past since an impression was reported, that the impression will receive a click, based on how long previous clicks required. Clearly one would expect that this probability will decrease over time if there is still no click.
If necessary, we can set an upper-bound, beyond which we consider the click probability to be 0 (eg. if its been an hour since the impression occurred, we can be pretty sure there won’t be a click).
The algorithm should be both space and time efficient, and hopefully make as few assumptions as possible, while being elegant. Ease of implementation would also be nice. Any ideas?
Assuming you keep data on past impressions and clicks, it’s easy: let’s say that you have an impression, and a time d’ has passed since that impression. You can divide your data into three groups:
Clearly the current impression is not in group (1), so eliminate that. You want the probability it is in group (2), which is then
where
N2is the number of impressions in group 2, and similarly forN3.As far as actual implementation, my first thought would be to keep an ordered list of the times d for past impressions which did receive clicks, along with a count of the number of impressions which never received a click, and just do a binary search for d’ in that list. The position you find will give you
N1, and thenN2is the length of the list minusN1.If you don’t need perfect granularity, you can store the past times as a histogram instead, i.e. a list that contains, in each element
list[n], the number of impressions that received a click after at leastnbut less thann+1minutes. (Or seconds, or whatever time interval you like) In that case you’d probably want to keep the total number of clicks as a separate variable so you can easily computeN2.(By the way, I just made this up, I don’t know if there are standard algorithms for this sort of thing that may be better)