I am looking at working on an NLP project, in any programming language (though Python will be my preference).
I want to take two documents and determine how similar they are.
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
The common way of doing this is to transform the documents into TF-IDF vectors and then compute the cosine similarity between them. Any textbook on information retrieval (IR) covers this. See esp. Introduction to Information Retrieval, which is free and available online.
Computing Pairwise Similarities
TF-IDF (and similar text transformations) are implemented in the Python packages Gensim and scikit-learn. In the latter package, computing cosine similarities is as easy as
or, if the documents are plain strings,
though Gensim may have more options for this kind of task.
See also this question.
[Disclaimer: I was involved in the scikit-learn TF-IDF implementation.]
Interpreting the Results
From above,
pairwise_similarityis a Scipy sparse matrix that is square in shape, with the number of rows and columns equal to the number of documents in the corpus.You can convert the sparse array to a NumPy array via
.toarray()or.A:Let’s say we want to find the document most similar to the final document, "The scikit-learn docs are Orange and Blue". This document has index 4 in
corpus. You can find the index of the most similar document by taking the argmax of that row, but first you’ll need to mask the 1’s, which represent the similarity of each document to itself. You can do the latter throughnp.fill_diagonal(), and the former throughnp.nanargmax():Note: the purpose of using a sparse matrix is to save (a substantial amount of space) for a large corpus & vocabulary. Instead of converting to a NumPy array, you could do: