I’m trying to create an algorithm that set some relevance to a webpage based on keywords that it finds on the page.
I’m doing this at the moment:
I set some words and a value for they: “movie”(10), “cinema”(6), “actor”(5) and “hollywood”(4) and search on some parts of the page giving a weight for each part and multiplying the words weight.
Example: the “movie” word word was found in the URL(1.5) * 10 and in title(2.5) * 10 = 40
This is trash! It’s my first attempt, and it return some relevant results, but I don’t think that a relevance determined by a value like 244, 66, 30, 15 is useful.
I want to do something that be inside a range, from 0 to 1 or 1 to 100.
What type of weighting for words can I use?
Besides it, there are ready algorithms to set some relevance of an HTML page based in things like URL, keywords, title, etc., except the main content?
EDIT 1: All of this can be rebuilt, the weights are random, I want to use some weights concise, not ramdon numbers to represent the weight like 10, 5 and 3.
Something like: low importance = 1, medium importance = 2, high importante = 4, deterministic importance = 8.
Title > Link Part of URL > Domain > Keywords
movie > cinema> actor > hollywood
EDIT 2: At the moment, I want to analyze the page relevance for words excluding the body content of the page. I will include in the analysus the domain, the link part of the url, the title, keywords (and another meta informations I judge useful).
The reason for this is that the HTML content is dirty. I can find much words like ‘movie’ in menus and advertisements, but the main content of the page doesn’t contains nothing relevant to the theme.
Another reason is that some pages has meta information indicating that pages contains info about a movie, but the main content no. Example: a page that contains the plot of the film telling the history, the characters, etc., but don’t refers in that text nothing that can indicate that this is about a movie, only the page meta information.
Later, after running a relevance analysis on the HTML page, I will do a relevance analysis on the content (filtered) separatedly.
Are you able to index these documents in a search engine? If you are then maybe you should consider using this latent semantic library.
You can get the actual project from here: https://github.com/algoriffic/lsa4solr
What you are trying to do, is determine the meaning of a text corpus, and classify it based on it’s meaning. However, words are not individually unique or to be considered in abstract away from the overall article.
For example, suppose that you have an article which talks a lot about “Windows”. This word is used 7 times in a 300 word article. So you know that it is important. However, what you don’t know, is if it is talking about the Operating System “Windows” or the things that you look through.
Suppose then that you also see words such as “Installation”, well, that doesn’t help you at all either. Because people install windows into their houses much like they install windows operating system. However, if the very same article talks about defragmentation, operating systems, command line and Windows 7, then you can guess that the meaning of this document is actual about the Windows operating system.
However, how can you determine this?
This is where Latent Semantic Indexing comes in. What you want to do, is extract the entire documents text and then apply some clever analysis to that document.
The matrix’es that you build (see here) are way above my head, and although I have looked at some libraries and used them, I have never been able to fully understand the complex math that goes behind building the space aware matrix that is unsed by Latent Semantic Analysis… so in my advice, I would recommend, just using an already existing library to do this for you.
Happy to remove this answer if you aren’t looking for external libraries and want to do this yourself