Is there any simple algorithm to determine the likeliness of 2 names representing the same person?
I’m not asking for something of the level that Custom department might be using. Just a simple algorithm that would tell me if ‘James T. Clark’ is most likely the same name as ‘J. Thomas Clark’ or ‘James Clerk’.
If there is an algorithm in C# that would be great, but I can translate from any language.
I’ve faced similar problem and tried to use Levenstein distance first, but it did not work well for me. I came up with an algorithm that gives you ‘similarity’ value between two strings (higher value means more similar strings, ‘1’ for identical strings). This value is not very meaningful by itself (if not ‘1’, always 0.5 or less), but works quite well when you throw in Hungarian Matrix to find matching pairs from two lists of strings.
Use like this:
The code behind:
Levenstein distance one is much simpler (adapted from http://www.merriampark.com/ld.htm):