Anyone know of a simple way to use Python and NLTK to get the article that follows closest to a search query? For example, I would like to take 10 articles from Wikipedia, find the frequency distributions for each of them (along with another method of classification, if you have any recommendations), and based on a search query, return the most likely articles that you may be referring to.
Any ideas? I would like a better method other than frequency distribution but I thought I would start there.
Rocchio’s algorithm aka TFxIDF aka aka tf-idf aka tfidf aka even tf/idf (sic) is pretty much the standard solution. Instead of the bare frequency, you calculate the term frequency for the whole document set, then express the term’s weight as the document’s term frequency divided by the total frequency count. That way, you don’t need stop words, because the IDF of a common word will make its weight nearly zero.