I have a very basic question about calculating RMSE in an NB classification scenario. My training data X has some 1000-odd reviews with ratings in [1,5] which are the class labels Y.
So what I am doing is something like this:
model = nb_classifier_train(trainingX,Y)
Yhat = nb_classifier_test(model,testingX)
My testing data has some 400-odd reviews with missing ratings (whose labels/ratings I need to predict. Now to calculate RMSE
RMSE = sqrt(mean((Y - Yhat).^2))
What is the Y in this scenario? I understand RMSE is calculated using difference between predicted values and actual values. What are the actual values here? Or is there something missing?
Y in this case is the labels for your training data, so the RMSE you’re calculating does not make much sense since you are making a prediction on the test examples and comparing against the training labels. In fact, there is no reason that Y and Yhat vectors would even be the same length. Instead you should replace the Y with your test labels, and if you don’t have test labels then you simply have no way of calculating your test error.