EDIT Solution
this was the solution, thanks to @mprivat:
from mysql_wrapper2 import Mysql
import Levenshtein
import time
db = Mysql()
fields = [...]
records = db.query('SELECT *, CONCAT(%s) as `string`, SOUNDEX(CONCAT(%s)) as `soundsLike` FROM records ORDER BY `soundsLike`' % (','.join(fields), ','.join(fields)))
print len(records)
matches = []
start = time.clock()
for k,v in enumerate(records):
if k+1 == len(records):
break
if Levenshtein.ratio(records[k]['string'],records[k+1]['string']) >= .98:
matches.append((records[k],records[k+1]))
print time.clock() - start
print len(matches)
End Solution
I have a MySQL database for a web application that may (or may not), have duplicate records in it. These records represent persons (i.e., name, address, etc. etc.).
Currently, we have a python script that pulls all records, compares them with shingles, and then presents a managing user with the potential duplicates. This script worked GREAT with < 1000 records. Now that we are testing with real~ish data (89k records), it’s become unsustainable. I stopped the script at 16GB of memory usage, and I don’t know if it was FINISHED yet! [Although I value our memory, speed is much more important, until around 30GB of memory — if processing 500k records]
That method used pygraph for documenting the data, and shingles for finding duplicates. It was still stuck in the graphing — so I’m assuming that means it wasn’t even finished looking at the records!
We are looking to be able to compare “similar” matches, so differences in fields (like ‘St.’ and ‘Street’) don’t become grounds for uniqueness. We may not be dealing with simple synonyms, so just synonym replacement is not an option. So I guess we’re looking for fuzzy-matching.
I’ve tried a few simple solutions, but they’re just not fast enough.
Although in this case the hash builds quickly (the current example with shingles does not… It built for about 8 hours and wasn’t finished), it’s speed at comparisons is horrendous (but that’s algorithm inefficiency, not interpreter speed)
I’ve also been researching as many options as I possibly can. I’ve been stuck with the Nearest Neighbor Search for a while, with hits promise of DNlog(n) search.
There are two problems with the Nearest Neighbor Search, that you may very well know how to overcome, but I have not been able to:
- Not all keys are numerical
- Constraint of our current system is that we only want two matches in a set [K-Nearest Neighbor], but only if they’re a pretty clear match (Fixed radius Nearest Neighbor)
I’ve also been lucking into some kind of clustering solution. I’ve yet to find a working python library, with a variable number of clusters. Every solution I’ve found, and have been unable to implement, requires advance knowledge of the numbers of clusters to have in the end.
- We should only have matches if they’re actually similar
- We do not know how many there will be, if any.
For #2 we can certainly filter once we have an answer, but which would be better? But that’s only if #1 could be overcome.
Since the constraints on the data we’re comparing, I have yet to be able to impliment either a NN or clustering solution, and had to move to plain old comparisons. In doing so, to remove the number of comparison domains, I concatenated all of a records’ values, and performed a Levenshtein comparison on the strings to all other strings. This solution simply isn’t fast enough either. It doesn’t finish one single record in ~20 minutes (I stopped timing at that point).
What I came up with as a trivial example:
from mysql_wrapper2 import Mysql
import Levenshtein
db = Mysql()
# Get 89k records from the database
records = db.query(""""
SELECT *
FROM records
""")
# Build a dictionary where the keys are our records, and their IDs are our data
# Only passing ID, we only use about 3GB of memory, instead of
# > 16GB with the actual dictionaries
records_hash = {}
for i in records:
key = []
for k in i.values():
key.append(str(i))
records_hash['-'.join(key)] = i['id']
# Compare every key to every key with the Levenshtein ratio
for i in records_hash.keys():
print 'once...'
for j in records_hash.keys():
if Levenshtein.ratio(i,j) >= .75:
pass
Once again, this is a trivial example, and if were actually implemented, would make sure that no two records were checked twice, etc. etc.
Is there any solution? Is there a way I could impliment either NN or Clustering to fit my problem? Is there any other pattern I’m not considering that could help? Am I SOL?
I have not actually tried it, and you may find it doesn’t work, but I’ll share anyway because it’s something I would try if I was faced with this same problem.
I would add a new column to the table. I would then go through all the records and do this:
concatenate the first name, last name, address, etc… into a string for which I would compute a Soundex-like (or slightly modified for your needs, for example leave digits in place if any, remove non-alphanumeric characters, adding support for long strings, etc…)
I would then compute the soundex code and store it in my new column
Then, assume soundex does its job, you could query the records, sorting by soundex, and then applying the comparison logic you already have. The idea is to compare only records that are nearby each other in the sorted result set.
By soundex-like, I mean different from the normal soundex which is one letter and three numbers. But that won’t work well for long strings. May be use more than three numbers.
Anyway, just wanted to give you another avenue to explore…