Share
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
Lucene uses number of features to score documents, but basically scoring relies on similarity between document and your query. I explained idea of calculating similarity between documents earlier in more or less simple words, so let me explain it here only briefly.
If you have dictionary of all words, you may organize them into long-long list. Mathematicians are used to use term “vector” for any sequences, including lists of words, so let’s call it vector of words:
We can express each document in our collection also as vector, where each element stands for number of occurrences of corresponding word in this document. For example, if document has 1 occurrence of word “bananas”, 2 occurrences of “about” and no occurrences of “abbat”, then document vector will start as follows:
Now the most interesting part comes. We can assume that if 2 documents have a lot of common words, they are about similar topics, and if they have very few in common, then these documents are very different. Since we already know that documents may be represented as vectors of words, we can calculate similarity of documents as similarity of their vectors.
There are many ways to calculate how similar are 2 vectors. Lucene uses quite simple – cosine distance. The idea comes from geometrical representation of vectors and angle between them – if you draw 2 vectors in 2D space, you will see that the more similar are coordinates of these vectors, the less is the angle between them. This is where cosine distance comes from, but in fact you should only care about number of same words in 2 documents.
When tasking about search engines, queries are treated just like documents: document vector is built for them and is then used to find the most similar (i.e. relevant) documents from collection.