Let TARGET be a set of strings that I expect to be spoken. Let

Question

0

Asked: June 3, 20262026-06-03T08:22:29+00:00 2026-06-03T08:22:29+00:00

Let TARGET be a set of strings that I expect to be spoken. Let

0

Let TARGET be a set of strings that I expect to be spoken.
Let SOURCE be the set of strings returned by a speech recognizer (that is, the possible sentences that it has heard).

I need a way to choose a string from TARGET. I read about the Levenshtein distance and the Damerau-Levenshtein distance, which basically returns the distance between a source string and a target string, that is the number of changes needed to transform the source string into the target string.

But, how can I apply this algorithm to a set of target strings?

I thought I’d use the following method:

For each string that belongs to TARGET, I calculate the distance from each string in SOURCE. In this way we obtain an m-by-n matrix, where n is the cardinality of SOURCE and n is the cardinality of TARGET. We could say that the i-th row represents the similarity of the sentences detected by the speech recognizer with respect to the i-th target.
Calculating the average of the values on each row, you can obtain the average distance between the i-th target and the output of the speech recognizer. Let’s call it average_on_row(i), where i is the row index.
Finally, for each row, I calculate the standard deviation of all values in the row. For each row, I also perform the sum of all the standard deviations. The result is a column vector, in which each element (Let’s call it stadard_deviation_sum(i)) refers to a string of TARGET.

The string which is associated with the shortest stadard_deviation_sum could be the sentence pronounced by the user. Could be considered the correct method I used? Or are there other methods?
Obviously, too high values indicate that the sentence pronounced by the user probably does not belong to TARGET.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-03T08:22:31+00:00

You need to calculate these probabilities first: probability of insertion, deletion and substitution. Then use log of these probabilities as penalties for each operation.

In a “context independent” situation, if pi is probability of insertion, pd is probability of deletion and ps probability of substitution, the probability of observing the same symbol is pp=1-ps-pd.

In this case use log(pi/pp/k), log(pd/pp) and log(ps/pp/(k-1)) as penalties for insertion, deletion and substitution respectively, where k is the number of symbols in the system.

Essentially if you use this distance measure between a source and target you get log probability of observing that target given the source. If you have a bunch of training data (i.e. source-target pairs) choose some initial estimates for these probabilities, align source-target pairs and re-estimate these probabilities (AKA EM strategy).

You can start with one set of probabilities and assume context independence. Later you can assume some kind of clustering among the contexts (eg. assume there are k different sets of letters whose substitution rate is different…).

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Let TARGET be a set of strings that I expect to be spoken. Let

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply