Say I have a bunch of data on some people. This could include Name, DOB, Address, Email, etc… Assume there are no unique identifiers (like an id column) on this data, but also assume that there are no repeating rows. I need to figure out the minimum set of fields I can use to query that data and return a unique row.
An example of a solution would be: “I can make a query that specifies a first name, dob, email, and zip, and that would return exactly one or zero rows.”
Did I ask that in a way that makes sense? I am looking for a technique, algorithm, or software package that would solve this problem for a given set of data. Anything that could provide an answer would work. Thanks!
EXAMPLE DATA (the real stuff is much more complex):
FNAME LNAME DOB ZIP email
John Smith 1/1/12 77777 dude@fake.com
Sean Smith 1/2/08 77777 dude@fake.com
Sean William 4/2/07 77789 stuff@fake.com
Richard Ross 1/1/12 78989 foo@fake.com
The solution for this set of data would be (FNAME, LNAME) or (EMAIL, DOB) or (EMIAL, FNAME).
i think you will need an iterative approach.
perhaps you can begin with each column, and attempt to create a unique index.
if you have success, then done.
if you are unable to create unique index then add another column and try again.
do this for all columns until you can successfully make the index.