I have a project where I need to compare multi-chapter documents to a second document to determine their similarity. The issue is I have no idea how to go about doing this, what approaches exist or if their are any libraries available.
My first question is… what is similar? The numbers of words that match, the number of consecutive words that match?
I could see writing a parser that puts each document into an array with the word and location and then comparing them.
I saw the earlier question at
Algorithms or libraries for textual analysis, specifically: dominant words, phrases across text, and collection of text
however, it seems somewhat different than what I am attempting to do.
Any options or pointers people may have would be great!
“what is similar” we can’t tell you that, this is a statement of a fundamental requirement of your project. If you don’t know this, its a bit soon to think about how to do it.
It may be helpful to ask the question “why”. What will the measure of similarity be used for?
If, for example, the purpose is to detect plagiarism then detecting that two essays are similar because they talk about the same subjects and make similar references is not likely to be helpful – the entire class would submit similar essays! So there you might be looking for matching exact sentences and phrases.
If instead you are trying build a catalogue for some documents then perhaps you would search out key words. Two documents are similar if they use the same vocabulary of words over a certian length, or similar proper nouns.
Those two examples are intended to demonstrate that until we understand what is meant by similar it is hard to give much advice.
However, here’s a possible approach. You’ve could write two main things: an Extractor and a Comparator.
The extractor’s job is to munge through the document and produce the set (or list, does it need to be ordered ?) of chunks that are the essence of the document: these might be individual words or sentences and phrases.
The comparator’s job is to evaluate similarity of two documents “essence”.
Simple example: extract the unique list of words of 8 letters or more from the document.
Comparison could then be two documents are similar if one’s set contains more than 75% of the others.