I need to implement a string-matching algorithm to determine which strings most closely match. I see the the Hamming distance is a good matching algorithm when this fixed-length is obtainable.
Is there any advantage in the quality of matching if I were to use the Levenshtein distance formula instead? I know this method is less efficient, given that it accounts for variable-length strings, but what I’m really concerned with here are the quality of the matches. Also, are there any better algorithms out there I may want to consider? I work in Java if that makes any difference.
http://en.wikipedia.org/wiki/Levenshtein_distance
http://en.wikipedia.org/wiki/Hamming_distance
Much thanks
Consider the strings: “abcdefg” and “bcdefgh”.
The Levenshtein distance is 2. The Hamming distance (operating on characters rather than bits) is 7.
So it really depends whether you want to treat those strings as being similar, or not. Hamming distance has its appropriate uses, but “will these strings look similar to a human being?” is not one of them.