I have ~25.000 distinct names in an SQL database, and would like to perform

Question

0

Asked: May 28, 20262026-05-28T20:24:46+00:00 2026-05-28T20:24:46+00:00

I have ~25.000 distinct names in an SQL database, and would like to perform

0

I have ~25.000 distinct names in an SQL database, and would like to perform edit-distance comparison on all of these in order to normalize e.g. John Doe & Jhon Doe.

When the db was only around 1000 names I used to store all distinct names in an array. Then I would use two for-loops on that array, thereby comparing each element in the array to each of the others. When the edit-distance gave a match of say >0.9 I would execute an SQL-query substituting one value for the other in all records.

With my much larger database this is not possible anymore. What would you guys do?

ps: I’m also curious about any multithreaded solutions to this because the process is taking ages now.

pps: I’m coding in Java

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-28T20:24:48+00:00

What about computing the soundex of each of your names and possibly storing it in the database? You can even do that on DB side, for instance there’s a MySQL SOUNDEX function.

After computing the soundex of each name, all you have to do is group the rows by identical soundex.

EDIT:

If soundex is too coarse for your application, you can first select candidates by comparing their soundexes, and use your usual metric on each set of candidates.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have ~25.000 distinct names in an SQL database, and would like to perform

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply