I have a large (more than 100K objects) collection of Java objects like below.
public class User
{
//declared as public in this example for brevity...
public String first_name;
public String last_name;
public String ssn;
public String email;
public String blog_url;
...
}
Now, I need to search this list for an object where at least 3 (any 3 or more) attributes match those of the object being searched.
For example, if I am searching for an object that has
first_name="John",
last_name="Gault",
ssn="000-00-0000",
email="xyz@abc.com",
blog_url="http://myblog.wordpress.com"
The search should return me all objects where first_name,last_name and ssn match or those where last_name, ssn, email and blog_url match. Likewise, there could be other combinations.
I would like to know what’s the best data-structure/algorithm to use in this case. For an exact search, I could have used a hashset or binary search with a custom comparator, but I am not sure what’s the most efficient way to perform this type of search.
P.S.
-
This is not a homework exercise.
-
I am not sure if the question title is appropriate. Please feel free to edit.
EDIT
Some of you have pointed out the fact that I could use ssn (for ex.) for the search as it is more or less unique. The exmaple above is only illustrative of the real scenario. In reality, I have several objects where some of the fields are null so I would like to search on other fields.
I don’t think that there are any specific data structures to make this kind of matching / comparison fast.
At the simple level of comparing two objects, you might implement a method like this:
To do a large scale search, the only way I can think of that would improve on a simple linear scan (using the method above) would be
Then each time you want to do a query:
closeEnough()to find the matches.You could improve on this by treating the SSN, email address and blog URL properties differently to the name properties. Multiple users with matches on the first three properties should be a rare occurrence, compared with (say) finding multiple users called “John”. The way that you have posed the question requires at least 1 of SSN, email or URL to match (to get 3 matches), so maybe you could not bother indexing the name properties at all.