Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8350911
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 9, 20262026-06-09T08:27:57+00:00 2026-06-09T08:27:57+00:00

EDIT Solution this was the solution, thanks to @mprivat: from mysql_wrapper2 import Mysql import

  • 0

EDIT Solution
this was the solution, thanks to @mprivat:

from mysql_wrapper2 import Mysql
import Levenshtein
import time

db = Mysql()
fields = [...]

 records = db.query('SELECT *, CONCAT(%s) as `string`, SOUNDEX(CONCAT(%s)) as `soundsLike` FROM records ORDER BY `soundsLike`' % (','.join(fields), ','.join(fields)))

print len(records)

matches = []


start = time.clock()
for k,v in enumerate(records):
        if k+1 == len(records):
                break

        if Levenshtein.ratio(records[k]['string'],records[k+1]['string']) >= .98:
                matches.append((records[k],records[k+1]))

print time.clock() - start
print len(matches)

End Solution

I have a MySQL database for a web application that may (or may not), have duplicate records in it. These records represent persons (i.e., name, address, etc. etc.).

Currently, we have a python script that pulls all records, compares them with shingles, and then presents a managing user with the potential duplicates. This script worked GREAT with < 1000 records. Now that we are testing with real~ish data (89k records), it’s become unsustainable. I stopped the script at 16GB of memory usage, and I don’t know if it was FINISHED yet! [Although I value our memory, speed is much more important, until around 30GB of memory — if processing 500k records]

That method used pygraph for documenting the data, and shingles for finding duplicates. It was still stuck in the graphing — so I’m assuming that means it wasn’t even finished looking at the records!

We are looking to be able to compare “similar” matches, so differences in fields (like ‘St.’ and ‘Street’) don’t become grounds for uniqueness. We may not be dealing with simple synonyms, so just synonym replacement is not an option. So I guess we’re looking for fuzzy-matching.

I’ve tried a few simple solutions, but they’re just not fast enough.

Although in this case the hash builds quickly (the current example with shingles does not… It built for about 8 hours and wasn’t finished), it’s speed at comparisons is horrendous (but that’s algorithm inefficiency, not interpreter speed)

I’ve also been researching as many options as I possibly can. I’ve been stuck with the Nearest Neighbor Search for a while, with hits promise of DNlog(n) search.

There are two problems with the Nearest Neighbor Search, that you may very well know how to overcome, but I have not been able to:

  1. Not all keys are numerical
  2. Constraint of our current system is that we only want two matches in a set [K-Nearest Neighbor], but only if they’re a pretty clear match (Fixed radius Nearest Neighbor)

I’ve also been lucking into some kind of clustering solution. I’ve yet to find a working python library, with a variable number of clusters. Every solution I’ve found, and have been unable to implement, requires advance knowledge of the numbers of clusters to have in the end.

  1. We should only have matches if they’re actually similar
  2. We do not know how many there will be, if any.

For #2 we can certainly filter once we have an answer, but which would be better? But that’s only if #1 could be overcome.

Since the constraints on the data we’re comparing, I have yet to be able to impliment either a NN or clustering solution, and had to move to plain old comparisons. In doing so, to remove the number of comparison domains, I concatenated all of a records’ values, and performed a Levenshtein comparison on the strings to all other strings. This solution simply isn’t fast enough either. It doesn’t finish one single record in ~20 minutes (I stopped timing at that point).

What I came up with as a trivial example:

from mysql_wrapper2 import Mysql
import Levenshtein

db = Mysql()
# Get 89k records from the database
records = db.query(""""
    SELECT *
    FROM records 
""")

# Build a dictionary where the keys are our records, and their IDs are our data
# Only passing ID, we only use about 3GB of memory, instead of 
#  > 16GB with the actual dictionaries

records_hash = {}
for i in records:
    key = []
    for k in i.values():
        key.append(str(i))
        records_hash['-'.join(key)] = i['id']

# Compare every key to every key with the Levenshtein ratio

for i in records_hash.keys():
    print 'once...'
    for j in records_hash.keys():
        if Levenshtein.ratio(i,j) >= .75:
            pass

Once again, this is a trivial example, and if were actually implemented, would make sure that no two records were checked twice, etc. etc.

Is there any solution? Is there a way I could impliment either NN or Clustering to fit my problem? Is there any other pattern I’m not considering that could help? Am I SOL?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-09T08:28:00+00:00Added an answer on June 9, 2026 at 8:28 am

    I have not actually tried it, and you may find it doesn’t work, but I’ll share anyway because it’s something I would try if I was faced with this same problem.

    I would add a new column to the table. I would then go through all the records and do this:

    1. concatenate the first name, last name, address, etc… into a string for which I would compute a Soundex-like (or slightly modified for your needs, for example leave digits in place if any, remove non-alphanumeric characters, adding support for long strings, etc…)

    2. I would then compute the soundex code and store it in my new column

    Then, assume soundex does its job, you could query the records, sorting by soundex, and then applying the comparison logic you already have. The idea is to compare only records that are nearby each other in the sorted result set.

    By soundex-like, I mean different from the normal soundex which is one letter and three numbers. But that won’t work well for long strings. May be use more than three numbers.

    Anyway, just wanted to give you another avenue to explore…

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Edit (Solution Discovered) Thanks everyone for your help on this. The problem was an
I'm trying to query mysql for an average of SUM time (taken from datetime
Intro: EDIT: See solution at the bottom of this question (c++) I have a
My current solution will suck sometimes EDIT For those who don't understand,see this example:
EDIT: solved, look below for my solution. first of all, this is my very
I have some text that I get from the database like this: $solution =
EDIT: The solution to my problem is to implement IXMLSerializer. Thanks everyone for the
EDIT : proper solution: void add(Student s) { if(root == null) root = new
Solution Edit: Turns out you can't use the PHP SDK to return the correct
[edit] Found the solution. Reinstall EVERYTHING - xcode, mono, monodevelop and monotouch. Now it

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.