- Let
TARGETbe a set of strings that I expect to be spoken. - Let
SOURCEbe the set of strings returned by a speech recognizer (that is, the possible sentences that it has heard).
I need a way to choose a string from TARGET. I read about the Levenshtein distance and the Damerau-Levenshtein distance, which basically returns the distance between a source string and a target string, that is the number of changes needed to transform the source string into the target string.
But, how can I apply this algorithm to a set of target strings?
I thought I’d use the following method:
- For each string that belongs to
TARGET, I calculate the distance from each string inSOURCE. In this way we obtain an m-by-n matrix, where n is the cardinality ofSOURCEand n is the cardinality ofTARGET. We could say that the i-th row represents the similarity of the sentences detected by the speech recognizer with respect to the i-th target. - Calculating the average of the values on each row, you can obtain the average distance between the i-th target and the output of the speech recognizer. Let’s call it
average_on_row(i), whereiis the row index. - Finally, for each row, I calculate the standard deviation of all values in the row. For each row, I also perform the sum of all the standard deviations. The result is a column vector, in which each element (Let’s call it
stadard_deviation_sum(i)) refers to a string ofTARGET.
The string which is associated with the shortest stadard_deviation_sum could be the sentence pronounced by the user. Could be considered the correct method I used? Or are there other methods?
Obviously, too high values indicate that the sentence pronounced by the user probably does not belong to TARGET.
You need to calculate these probabilities first: probability of insertion, deletion and substitution. Then use log of these probabilities as penalties for each operation.
In a “context independent” situation, if pi is probability of insertion, pd is probability of deletion and ps probability of substitution, the probability of observing the same symbol is pp=1-ps-pd.
In this case use log(pi/pp/k), log(pd/pp) and log(ps/pp/(k-1)) as penalties for insertion, deletion and substitution respectively, where k is the number of symbols in the system.
Essentially if you use this distance measure between a source and target you get log probability of observing that target given the source. If you have a bunch of training data (i.e. source-target pairs) choose some initial estimates for these probabilities, align source-target pairs and re-estimate these probabilities (AKA EM strategy).
You can start with one set of probabilities and assume context independence. Later you can assume some kind of clustering among the contexts (eg. assume there are k different sets of letters whose substitution rate is different…).