I want to find string similarity between two strings. en.wikipedia has examples of some of them. code.google has a Python implementation of Levenshtein distance.
Is there a better algorithm, (and hopefully a Python library), under these constraints:
- I want to do fuzzy matches between strings. eg matches(‘Hello, All you people’, ‘hello, all You peopl’) should return True
- False negatives are acceptable, False positives, except in extremely rare cases are not.
- This is done in a non realtime setting, so speed is not (much) of concern.
- [Edit] I am comparing multi word strings.
Would something other than Levenshtein distance(or Levenshtein ratio) be a better algorithm for my case?
There’s a great resource for string similarity metrics at the University of Sheffield. It has a list of various metrics (beyond just Levenshtein) and has open-source implementations of them. Looks like many of them should be easy to adapt into Python.
http://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html
Here’s a bit of the list: