I am wondering what is the best method to define a dictionary to calculate relevance of a specific website. At least dictionaries with words seem to be an important method of measuring relevance for new websites found via links (e.g. if a website is linked to, but it does not contain any word about soccer, it is probably irrelevant for my soccer crawler).
I came to the following ideas, but all of them have major drawbacks:
- Write a dictionary by hand -> you might forget a lot of words and it is very time consuming
- Take the most important words from the first website as dictionary -> a lot of words would probably be missing
- Take the most important words on all websites as entries in the dictionary and weight them by relevance (e.g. a website which is only relevant 0.4 would not have such a big impact on the dictionary as a website that is relevant 0.8) -> seems pretty complicated and could lead to unexpected results
The last method seems the best to me, but maybe there are better and more common methods?
I would recommend that you build a common-word dictionary from a list of known sites. Suppose you have 100 sites and you know that they’re all talking about soccer. You can build unigram and bigram (or n-gram) maps of the content and use it as a baseline from which you measure some type of “deviation” with regards to every new observation you make. Note that you would have to remove the common stopwords in order to eliminate irrelevant words; in English there are quite a few, here is a list: http://www.ranks.nl/resources/stopwords.html
N-grams are frequency counts of words or combinations of words. Unigrams creates a map where the key is the word and the value is the number of occurrence for each word. Bigrams are usually constructed by combining two consecutive words and using them as the key, so forth for trigrams and n-grams.
You can take the top n-grams from your known sites and compare them against the top n-grams of the site you’re currently evaluating. The more similar they are, the more likely that the site is with the same topic.