Approximate string matching is not a stranger problem.
I am learning and trying to understand how to solve it. I even now don’t want to get too deep into it and just want to understand the brute-force way.
In its wiki page (Approximate string matching), it says
A brute-force approach would be to compute the edit distance to P (the pattern) for all substrings of T, and then choose the substring with the minimum distance. However, this algorithm would have the running time O(m * n^3), n is the length of T, m is the length of P
Ok. I understand this statement in the following way:
- We find out all possible substrings of T
- We compute the edit distance of each pair of strings {P, t1}, {P, t2}, …
- We find out which substring has the shortest distance from P and this substring is the answer.
I have the following question:
a. I can use two for-loop to get all possible substrings and this requires O(n^2). So when I try to compute the edit distance of one substring and the patter, does it need O(n*m)? Why?
b. How exactly do I compute the distance of one pair (one substring and the patter)? I know I can insert, delete, substitute, but can anyone give me a algorithm that do just the calculation for one pair?
Thanks
Edit
Ok, I should use Levenshtein distance, but I don’t quite understand its method.
Here is part of the code
for j from 1 to n
{
for i from 1 to m
{
if s[i] = t[j] then
d[i, j] := d[i-1, j-1] // no operation required
else
d[i, j] := minimum
(
d[i-1, j] + 1, // a deletion
d[i, j-1] + 1, // an insertion
d[i-1, j-1] + 1 // a substitution
)
}
}
So, assume I am now comparing {"suv", "svi"}.
So 'v' != 'i', then I have to see three other pairs:
{"su", "sv"}{"suv", "sv"}{"su", "svi"}
How can I understand this part? Why I need to see these 3 parts?
Does the distance between two prefixes mean that we need distance number of changes in order to make the two prefixes (or strings) equal?
So, let’s take a look at {"su", "sv"}. We can see that distance of {"su", "sv"} is 1. Then how can {"su", "sv"} become {"suv", "svi"} by just adding 1? I think we need to insert ‘v’ into “su” and ‘v’ into “sv” and then substitute the last ‘i’ with ‘v’, which has 3 operations involved, right?
The standard way of measuring the edit distance between two strings is called Levenshtein distance – the wikipedia page contains pseudocode for the algorithm.
As for your edit: You need to look at
{"su", "sv"}because it is possible that the best way to change"suv"into"svi"is to replace the lastvbyi, whose cost will come on top of the cost for changing"su"to"sv". Or, it could be that the best way is to change"suv"into"sv"somehow and then add ani. Or, it could be that the best way is to first delete thevfrom"suv"and then change"su"into"svi". The first way turns out to be best (or as good as the other options) in this case. The edit distance is indeed 2, and the operations are to change theuinto avand thevinto ani.