I’m using levenshtein algorithm to meet these requirements:
When finding a word of N characters, the words to suggest as correction in my dictionary database are:
Every dictionary word of N characters that has 1 character of difference with the found word. Example: found word:bearn, dictionary word: bears
Every dictionary word of N+1 characters that has N characters equal to the found word. Example: found word:bear, dictionary word: bears
Every dictionary word of N-1 characters that has N-1 characters equal to the found word. Example: found word: bears, dictionary word: bear
I’m using this implementation of Levenshtein algorithm in C++ to find when a word has a Levenshtein number of 1 (which is the Levenshtein number for all three cases), but then how do I choose the word to suggest? I read about Boyer-Moore-Horspool and Knuth-Morris-Pratt but I’m not sure on how either of them can be helpful.
#include <string> #include <vector> #include <algorithm> using namespace std; int levenshtein(const string &s1, const string &s2) { string::size_type N1 = s1.length(); string::size_type N2 = s2.length(); string::size_type i, j; vector<int> T(N2+1); for ( i = 0; i <= N2; i++ ) T[i] = i; for ( i = 0; i < N1; i++ ) { T[0] = i+1; int corner = i; for ( j = 0; j < N2; j++ ) { int upper = T[j+1]; if ( s1[i] == s2[j] ) T[j+1] = corner; else T[j+1] = min(T[j], min(upper, corner)) + 1; corner = upper; } } return T[N2]; }
You may also want to add Norvig’s excellent article on spelling correction to your reading.
It’s been a while since I’ve read it but I remember it being very similar to what your writing about.