Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 4609414
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 22, 20262026-05-22T00:59:00+00:00 2026-05-22T00:59:00+00:00

I am wondering what is the best method to define a dictionary to calculate

  • 0

I am wondering what is the best method to define a dictionary to calculate relevance of a specific website. At least dictionaries with words seem to be an important method of measuring relevance for new websites found via links (e.g. if a website is linked to, but it does not contain any word about soccer, it is probably irrelevant for my soccer crawler).

I came to the following ideas, but all of them have major drawbacks:

  • Write a dictionary by hand -> you might forget a lot of words and it is very time consuming
  • Take the most important words from the first website as dictionary -> a lot of words would probably be missing
  • Take the most important words on all websites as entries in the dictionary and weight them by relevance (e.g. a website which is only relevant 0.4 would not have such a big impact on the dictionary as a website that is relevant 0.8) -> seems pretty complicated and could lead to unexpected results

The last method seems the best to me, but maybe there are better and more common methods?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-22T00:59:01+00:00Added an answer on May 22, 2026 at 12:59 am

    I would recommend that you build a common-word dictionary from a list of known sites. Suppose you have 100 sites and you know that they’re all talking about soccer. You can build unigram and bigram (or n-gram) maps of the content and use it as a baseline from which you measure some type of “deviation” with regards to every new observation you make. Note that you would have to remove the common stopwords in order to eliminate irrelevant words; in English there are quite a few, here is a list: http://www.ranks.nl/resources/stopwords.html

    N-grams are frequency counts of words or combinations of words. Unigrams creates a map where the key is the word and the value is the number of occurrence for each word. Bigrams are usually constructed by combining two consecutive words and using them as the key, so forth for trigrams and n-grams.

    You can take the top n-grams from your known sites and compare them against the top n-grams of the site you’re currently evaluating. The more similar they are, the more likely that the site is with the same topic.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm wondering about best practice here. Is it good practice for a factory method
Wondering what the best method is for communicating between a 10.5/10.6+ System Preference Pane
I'm wondering what is the best method to handle AJAX calls with jQuery? Right
I was wondering what is the best method to achieve the results in this
I'm wondering what the best method is for creating a forgot password function on
I am wondering what the best method is. E.g. <script type=text/javascript src=<%= GetBaseURL() %>Scripts/jquery-1.4.1.min.js></script>
I'm wondering what the best method is to convert a time string in the
I am wondering what is the best approach/methods/technology to implement a C# desktop application
I'm wondering what the best way is to use the XDocument.load and save methods
Just wondering the best way to replace in place matches on a string. value.replace(bob,

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.