I’m evaluating a number of different algorithms whose job is to predict the probability of an event occurring.
I am testing the algorithms on large-ish datasets. I measure their effectiveness using “Root Mean Squared Error”, which is the square root of the ((sum of the errors) squared). The error is the difference between the predicted probability (a floating point value between 0 and 1) and the actual outcome (either 0.0 or 1.0).
So I know the RMSE, and also the number of samples that the algorithm was tested on.
The problem is that sometimes the RMSE values are quite close to each-other, and I need a way to determine whether the difference between them is just chance, or if it represents an actual difference in performance.
Ideally, for a given pair of RMSE values, I’d like to know what the probability is that one is really better than the other, so that I can use this probability as a threshold of significance.
You are entering into a vast and contentious area of not only computation but philosophy. Significance tests and model selection are subjects of intense disagreement between the Bayesians and the Frequentists. Triston’s comment about splitting the data-set into training and verification sets would not please a Bayesian.
May I suggest that RMSE is not an appropriate score for probabilities. If the samples are independent, the proper score is the sum of the logarithms of the probabilities assigned to the actual outcomes. (If they are not independent, you have a mess on your hands.) What I am describing is scoring a “plug-in” model. Proper Bayesian modeling requires integrating over the model parameters, which is computationally extremely difficult. A Bayesian way to regulate a plug-in model is to add a penalty to the score for unlikely (large) model parameters. That’s been called “weight decay.”
I got started on my path of discovery reading Neural Networks for Pattern Recognition by Christopher Bishop. I used it and and Practical Optimization by Gill, et al to write software that has worked very well for me.