Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6537235
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 25, 20262026-05-25T10:35:34+00:00 2026-05-25T10:35:34+00:00

I’m trying to create an algorithm that set some relevance to a webpage based

  • 0

I’m trying to create an algorithm that set some relevance to a webpage based on keywords that it finds on the page.

I’m doing this at the moment:

I set some words and a value for they: “movie”(10), “cinema”(6), “actor”(5) and “hollywood”(4) and search on some parts of the page giving a weight for each part and multiplying the words weight.

Example: the “movie” word word was found in the URL(1.5) * 10 and in title(2.5) * 10 = 40

This is trash! It’s my first attempt, and it return some relevant results, but I don’t think that a relevance determined by a value like 244, 66, 30, 15 is useful.

I want to do something that be inside a range, from 0 to 1 or 1 to 100.
What type of weighting for words can I use?

Besides it, there are ready algorithms to set some relevance of an HTML page based in things like URL, keywords, title, etc., except the main content?

EDIT 1: All of this can be rebuilt, the weights are random, I want to use some weights concise, not ramdon numbers to represent the weight like 10, 5 and 3.

Something like: low importance = 1, medium importance = 2, high importante = 4, deterministic importance = 8.

Title > Link Part of URL > Domain > Keywords
movie > cinema> actor > hollywood

EDIT 2: At the moment, I want to analyze the page relevance for words excluding the body content of the page. I will include in the analysus the domain, the link part of the url, the title, keywords (and another meta informations I judge useful).

The reason for this is that the HTML content is dirty. I can find much words like ‘movie’ in menus and advertisements, but the main content of the page doesn’t contains nothing relevant to the theme.

Another reason is that some pages has meta information indicating that pages contains info about a movie, but the main content no. Example: a page that contains the plot of the film telling the history, the characters, etc., but don’t refers in that text nothing that can indicate that this is about a movie, only the page meta information.

Later, after running a relevance analysis on the HTML page, I will do a relevance analysis on the content (filtered) separatedly.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-25T10:35:34+00:00Added an answer on May 25, 2026 at 10:35 am

    Are you able to index these documents in a search engine? If you are then maybe you should consider using this latent semantic library.

    You can get the actual project from here: https://github.com/algoriffic/lsa4solr

    What you are trying to do, is determine the meaning of a text corpus, and classify it based on it’s meaning. However, words are not individually unique or to be considered in abstract away from the overall article.

    For example, suppose that you have an article which talks a lot about “Windows”. This word is used 7 times in a 300 word article. So you know that it is important. However, what you don’t know, is if it is talking about the Operating System “Windows” or the things that you look through.

    Suppose then that you also see words such as “Installation”, well, that doesn’t help you at all either. Because people install windows into their houses much like they install windows operating system. However, if the very same article talks about defragmentation, operating systems, command line and Windows 7, then you can guess that the meaning of this document is actual about the Windows operating system.

    However, how can you determine this?

    This is where Latent Semantic Indexing comes in. What you want to do, is extract the entire documents text and then apply some clever analysis to that document.

    The matrix’es that you build (see here) are way above my head, and although I have looked at some libraries and used them, I have never been able to fully understand the complex math that goes behind building the space aware matrix that is unsed by Latent Semantic Analysis… so in my advice, I would recommend, just using an already existing library to do this for you.

    Happy to remove this answer if you aren’t looking for external libraries and want to do this yourself

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Basically, what I'm trying to create is a page of div tags, each has
I'm parsing an RSS feed that has an ’ in it. SimpleXML turns this
That's pretty much it. I'm using Nokogiri to scrape a web page what has
I have some data like this: 1 2 3 4 5 9 2 6
I am trying to understand how to use SyndicationItem to display feed which is
I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out
link Im having trouble converting the html entites into html characters, (&# 8217;) i
Does anyone know how can I replace this 2 symbol below from the string
this is what i have right now Drawing an RSS feed into the php,
I have just tried to save a simple *.rtf file with some websites and

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.