Algorithms for edit distance give a measure of the distance between two strings. Question:

Question

0

Asked: June 9, 20262026-06-09T14:08:25+00:00 2026-06-09T14:08:25+00:00

Algorithms for edit distance give a measure of the distance between two strings. Question:

0

Algorithms for edit distance give a measure of the distance between two strings.

Question: which of these measures would be most relevant to detect two different persons names which are actually the same? (different because of a mispelling). The trick is that it should minimize false positives. Example:

Obaama
Obama
=> should probably be merged

Obama
Ibama
=> should not be merged.

This is just an oversimple example. Are their programmers and computer scientists who worked out this issue in more detail?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-09T14:08:26+00:00

I can suggest an information-retrieval technique of doing so, but it requires a large collection of documents in order to work properly.

Index your data, using the standard IR techniques. Lucene is a good open source library that can help you with it.

Once you get a name (Obaama for example): retrieve the set of collections the word Obaama appears in. Let this set be D1.

Now, for each word w in D1¹ search for Obaama AND w (using your IR system). Let the set be D2.

The score |D2|/|D1| is an estimation how much w is connected to Obaama, and most likely will be close to 1 for w=Obama².

You can manually label a set of examples and find the value from which words will be expected.

Using a standard lexicographical similarity technique you can chose to filter out words that are definetly not spelling mistakes (Like Barack).

Another solution that is often used requires a query log – find a correlation between searched words, if obaama has correlation with obama in the query log – they are connected.

1: You can improve performance by first doing the 2nd filter, and check only for candidates who are “similar enough” lexicographically.

2: Usually a normalization is also used, because more frequent words are more likely to be in the same documents with any word, regardless of being related or not.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Algorithms for edit distance give a measure of the distance between two strings. Question:

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply