Now I am about to report the results from Named Entity Recognition. One thing that I find a bit confusing is that my understanding of precision and recall was that one simply sums up true positives, true negatives, false positives and false negatives over all classes.
But this seems implausible now that I think of it as each misclassification would give simultaneously rise to one false positive and one false negative (e.g. a token that should have been labelled as “A” but was labelled as “B” is a false negative for “A” and false positive for “B”). Thus the number of the false positives and the false negatives over all classes would be the same which means that precision is (always!) equal to recall. This simply can’t be true so there is an error in my reasoning and I wonder where it is. It is certainly something quite obvious and straight-forward but it escapes me right now.
The way precision and recall is typically computed (this is what I use in my papers) is to measure entities against each other. Supposing the ground truth has the following (without any differentiaton as to what type of entities they are)
[Microsoft Corp.] CEO [Steve Ballmer] announced the release of [Windows 7] todayThis has 3 entities.
Supposing your actual extraction has the following
[Microsoft Corp.] [CEO] [Steve] Ballmer announced the release of Windows 7 [today]You have an exact match for
Microsoft Corp, false positives forCEOandtoday, a false negative forWindows 7and a substring match forSteveWe compute precision and recall by first defining matching criteria. For example, do they have to be an exact match? Is it a match if they overlap at all? Do entity types matter? Typically we want to provide precision and recall for several of these criteria.
Exact match: True Positives = 1 (
Microsoft Corp., the only exact match), False Positives =3 (CEO,today, andSteve, which isn’t an exact match), False Negatives = 2 (Steve BallmerandWindows 7)Any Overlap OK: True Positives = 2 (
Microsoft Corp., andStevewhich overlapsSteve Ballmer), False Positives =2 (CEO, andtoday), False Negatives = 1 (Windows 7)The reader is then left to infer that the “real performance” (the precision and recall that an unbiased human checker would give when allowed to use human judgement to decide which overlap discrepancies are significant, and which are not) is somewhere between the two.
It’s also often useful to report the F1 measure, which is the harmonic mean of precision and recall, and which gives some idea of “performance” when you have to trade off precision against recall.