I have two DataFrames which I want to merge based on a column. However, due to alternate spellings, different number of spaces, absence/presence of diacritical marks, I would like to be able to merge as long as they are similar to one another.
Any similarity algorithm will do (soundex, Levenshtein, difflib’s).
Say one DataFrame has the following data:
df1 = DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
number
one 1
two 2
three 3
four 4
five 5
df2 = DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])
letter
one a
too b
three c
fours d
five e
Then I want to get the resulting DataFrame
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e
Similar to @locojay suggestion, you can apply
difflib‘sget_close_matchestodf2‘s index and then apply ajoin:.
If these were columns, in the same vein you could apply to the column then
merge: