I have a database which has very similar rows within the same table. Those rows are similar because they have nearly equal column values. I need to integrate those corresponding rows into one single row.
For example, those two users (u1 and u2) should be integrated:
u1 = User(name = "William Henry Gates III",
age = 55,
nationality = "american",
alma_mater = "Harvard Univesity")
u2 = User(name: "William Henry 'Bill' Gates III",
age: 55,
nationality: "America",
alma_mater: "Harvard U.")
I am thinking of using some edit distance and stemming techniques. Other algorithms and techniques suggestions? Any helpful libraries to use (preferably in Python or Java)?
Considered something like Refine?