I have ~25.000 distinct names in an SQL database, and would like to perform edit-distance comparison on all of these in order to normalize e.g. John Doe & Jhon Doe.
When the db was only around 1000 names I used to store all distinct names in an array. Then I would use two for-loops on that array, thereby comparing each element in the array to each of the others. When the edit-distance gave a match of say >0.9 I would execute an SQL-query substituting one value for the other in all records.
With my much larger database this is not possible anymore. What would you guys do?
ps: I’m also curious about any multithreaded solutions to this because the process is taking ages now.
pps: I’m coding in Java
What about computing the soundex of each of your names and possibly storing it in the database? You can even do that on DB side, for instance there’s a MySQL SOUNDEX function.
After computing the soundex of each name, all you have to do is group the rows by identical soundex.
EDIT:
If soundex is too coarse for your application, you can first select candidates by comparing their soundexes, and use your usual metric on each set of candidates.