There are N different classes that can be observed in my problem and my task is to detect which ones occurred at time t (of T frames). I created actualLabels and predictedLabels binary matrices of size NxT. I observed the data and filled actualLabels by hand. actualLabels(n,t) is 1 if the instance at time t involves nth class, otherwise it is 0. This serves as my ground truth data. Then, I run my algorithm on the data and predict the observed classes. The labels are found automatically and stored in predictedLabels.
My question is that how can I compute a success value using these matrices? Is there a popular way to do this?
Example case: Let there be 4 classes and T=5. Let the data be
actualLabels = 0 0 0 0 1
1 1 0 1 0
0 1 0 0 1
0 0 0 0 1
predictedLabels = 0 0 0 0 1
0 0 1 1 0
0 1 0 0 0
0 1 0 0 0
It seems to be not possible to compute a conventional confusion matrix from multi-class assignment. Instead I computed a distance in each pair. Since I have binary vectors to compare, Hamming distance seems to be nice (similar to edit distance). The problem now is that I can report the distances between predicted and actual label vectors, but not the success percentage.
A confusion matrix conveys lots of information. I would like to see a similar table that helps me to see where the mistakes occur a lot, the overall success, etc.
Details: I have some wav data and I want to do polyphonic pitch tracking. At each time bin, there can be any number of notes played together which forms the labels I want to predict.
Note: There are some metrics for multi-label classification in Wikipedia. I would be happy to learn any other metric or plot.
To measure success, you need to define it. Choose an error tolerance you are willing to accept (perhaps zero), and count how many predictions (have Hamming distances that) fall below it to get your percentage.
If your training matrices are sparse (mostly zeros), this may be a misleading measure since a model that always predicts the zero matrix will do well. Here you may want to look at precision and recall. These form a natural tradeoff and so it’s usually not possible to maximize both simultaneously. To combine them into a single metric, consider the f-score. Again, if your training data is not sparse, then the simple accuracy percentage is probably best.
Finally, if you are measuring accuracy in order to select from amongst several possible models (called validation), then beware of reusing your training data for this step. Instead, partition your data into training data, and cross-validation data. The trouble is your models are already biased towards the data they were trained on; just because they do well on that doesn’t mean they will generalize to what they might see in a real application. See the cross-validation wiki entry for more details.