I’ve some duplicated elements in my datastore (not the whole row, but most of the fields on it) in App Engine.
What’s the best way to find them?
I’ve both integer and string fields that are duplicated (in case comparing one is faster than the other).
Thanks!
An stupid but quick approach would be to take the fields you care about, concatenate them as a long string and store them as the key of an
DB_Uniqueentity that references the original entity. Each time you doDB_Unique.get_or_insert()you should verify the reference is to the correct original entity, otherwise, you have a duplicate. This should probably be done in a map reduce.Something like:
There are a couple of edge cases that might creep up, such as if you have two entities with b and c equal to (‘a_b’, ”) and (‘a’,’b_’) respectively, so the concatenation is ‘a_b_’ for both. so use a character you know is not in your strings instead of ‘_’, or have
DB_Unique.rbe a list of references and compare all of them.