Is there any dictionary based string matching algorithm in Java?
Something that will give the percentage of similarity between two strings based on the dictionary ?
Like
public double getSimilarity(String str1, String str2);
for which an implementation like :
getSimilarity("Professor", "Teacher")
will give a very high percentage ?
Thanks in advance 🙂
There is a great work done by Shaul Markovitch and Evgeniy Gabrilovich, described in their article: Wikipedia-based Semantic Interpretation for Natural Language Processing.
The idea is as follows: Index wikipedia (or other context source).
Creating a mapping for each term (word):
term -> articles in which the term appears in.Each term is basically represented by a vector – for simplicity, let’s say it is a binary vector – so for the term
tthe entrydwill be ‘1’ if and only if the termtappears in the documentd.Now, given these vectors – to find if two terms
t1,t2are similar – all you have to do it take the vector similarity of the two vectors that representt1andt2.Note: The binary vector is a simplification, in fact the article uses the tf-idf score, that the term
thas in a documentd.