I created a heuristic (an ANN, but that’s not important) to estimate the probabilities of an event (the results of sports games, but that’s not important either). Given some inputs, this heuristics tell me what are the probabilities of the event. Something like : Given theses inputs, team B as 65% chances to win.
I have a large set of inputs data for which I now the result (games previously played). Which formula/metric could I use to qualify the accuracy of my estimator.
The problem I see is, if the estimator says the event has a probability of 20% and the event actually do occurs. I have no way to tell if my estimator is right or wrong. Maybe it’s wrong and the event was more likely than that. Maybe it’s right, the event as about 20% chance to occur and did occur. Maybe it’s wrong, the event has really low chances to occurs, say 1 in 1000, but happened to occur this time.
Fortunately I have lots of theses actual test data, so there is probably a way to use them to qualify my heuristic.
anybody got an idea?
There are a number of measurements that you could use to quantify the performance of a binary classifier.
Do you care whether or not your estimator (ANN, e.g.) outputs a calibrated probability or not?
If not, i.e. all that matters is rank ordering, maximizing area under ROC curve (AUROC) is a pretty good summary of the performance of the metric. Others are “KS” statistic, lift. There are many in use, and emphasize different facets of performance.
If you care about calibrated probabilities then the most common metrics are the “cross entropy” (also known as Bernoulli probability/maximum likelihood, the typical measure used in logistic regression) or “Brier score”. Brier score is none other than mean squared error comparing continuous predicted probabilites to binary actual outcomes.
Which is the right thing to use depends on the ultimate application of the classifier. For example, your classifier may estimate probability of blowouts really well, but be substandard on close outcomes.
Usually, the true metric that you’re trying to optimize is “dollars made”. That’s often hard to represent mathematically but starting from that is your best shot to coming up with an appropriate and computationally tractable metric.