I have a few algorithms that extract and rank keywords [both terms and bigrams] from a paragraph [most are based on the tf-idf model].
I am looking for an experiment to evaluate these algorithms. This experiment should give a grade to each algorithm, indicating “how good was it” [on the evaluation set, of course].
I am looking for an automatic / semi-automatic method to evaluate each algorithm’s results, and an automatic / semi-automatic method to create the evaluation set.
Note: These experiments will be ran off-line, so efficiency is not an issue.
The classic way to do this would be to define a set of key words you want the algorithms to find per paragraph, then check how well the algorithms do with respect to this set, e.g. (generated_correct – generated_not_correct)/total_generated (see update, this is nonsense). This is automatic once you have defined this ground truth. I guess constructing that is what you want to automate as well when you talk about constructing the evaluation set? That’s a bit more tricky.
Generally, if there was a way to generate key words automatically that’s a good way to use as a ground truth – you should use that as your algorithm ;). Sounds cheeky, but it’s a common problem. When you evaluate one algorithm using the output of another algorithm, something’s probably going wrong (unless you specifically want to benchmark against that algorithm).
So you might start harvesting key words from common sources. For example:
Download scientific papers that have a keyword section. Check if those keywords actually appear in the text, if they do, take the section of text including the keywords, use the keyword section as ground truth.
Get blog posts, check if the terms in the heading appear in the text, then use the words in the title (always minus stop words of course) as ground truth
…
You get the idea. Unless you want to employ people to manually generate keywords, I guess you’ll have to make do with something like the above.
Update
The evaluation function mentioned above is stupid. It does not incorporate how many of the available key words have been found. Instead, the way to judge a ranked list of relevant and irrelevant results is to use precision and recall. Precision rewards the absence of irrelevant results, Recall rewards the presence of relevant results. This again gives you two measures. In order to combine these two into a single measure, either use the F-measure, which combines those two measures into a single measure, with an optional weighting. Alternatively, use Precision@X, where X is the number of results you want to consider. Precision@X interestingly is equivalent to Recall@X. However, you need a sensible X here, ie if you have less than X keywords in some cases, those results will be punished for never providing an Xth keyword. In the literature on tag recommendation for example, which is very similar to your case, F-measure and P@5 are often used.
http://en.wikipedia.org/wiki/F1_score
http://en.wikipedia.org/wiki/Precision_and_recall