I want to use the Levenshtein algorithm for the following task: if a user on my website searches for some value (he enters characters in a input), I want to instantly check for suggestions with AJAX, like Google Instant does.
I have the impression that the Levenshtein algorithm is way too slow for such a task. To check its behaviour, I first implemented it in Java, printing out the two Strings in every recursive call of the method.
public class Levenshtein {
public static void main(String[] arg){
String a = "Hallo Zusammen";
String b = "jfdss Zusammen";
int res = levenshtein(a, b);
System.out.println(res);
}
public static int levenshtein(String s, String t){
int len_s = s.length();
int len_t = t.length();
int cost = 0;
System.out.println("s: " + s + ", t: " + t);
if(len_s>0 && len_t>0){
if(s.charAt(0) != t.charAt(0)) cost = 1;
}
if(len_s == 0){
return len_t;
}else{
if(len_t == 0){
return len_s;
}else{
String news = s.substring(0, s.length()-1);
String newt = t.substring(0, t.length()-1);
return min(levenshtein(news, t) + 1,
levenshtein(s, newt) + 1,
levenshtein(news, newt) + cost);
}
}
}
public static int min(int a, int b, int c) {
return Math.min(Math.min(a, b), c);
}
}
However, here are some points:
- The check
if(len_s>0 && len_t>0)was added by me, because I was getting aStringIndexOutOfBoundsExceptionwith above test values. - With above test values, the algorithm seems to calculate infinitely
Are there optimizations that can be made on the algorithm to make it work for me, or should I use a completely different one to accomplish the desired task?
1) Few words about Levenshtein distance algorithm improvement
Recursive implementation of Levenshteins distance has exponential complexity.
I’d suggest you to use memoization technique and implement Levenshtein distance without recursion, and reduce complexity to
O(N^2)(needsO(N^2)memory)Or, even better – you may notice, that for each cell in distance matrix – you need only information about previous line, so you can reduce memory needs to
O(N):2) Few words about autocomplete
Levenshtein’s distance is preferred only if you need to find exact matches.
But what if your keyword would be
appleand user typedgreen apples? Levenshteins distance between query and keyword would be large (7 points). And Levensteins distance betweenappleandbcdfghk(dumb string) would be 7 points too!I’d suggest you to use full-text search engine (e.g. Lucene). The trick is – that you have to use n-gram model to represent each keyword.
In few words:
1) you have to represent each keyword as document, which contains n-grams:
apple -> [ap, pp, pl, le].2) after transforming each keyword to set of n-grams – you have to index each keyword-document by n-gram in your search engine. You’ll have to create index like this:
3) So you have n-gram index. When you get a query – you have to split it into n-grams. After this – you’ll have a set of user’s query n-grams. And all you need – is to match most similar documents from your search engine. In draft approach it would be enough.
4) For better suggest – you may rank results of search-engine by Levenshtein distance.
P.S. I’d suggest you to look through the book "Introduction to information retrieval".