I need to do some OCR on a large chunk of text and check if it contains a certain string but due to the inaccuracy of the OCR I need it to check if it contains something like a ~85% match for the string.
For example I may OCR a chunk of text to make sure it doesn’t contain no information available but the OCR might see n0 inf0rmation available or misinterpret an number of characters.
Is there an easy way to do this in Python?
As posted by
gauden,SequenceMatcherindifflibis an easy way to go. Usingratio(), returns a value between0and1corresponding to the similarity between the two strings, from the docs:example:
There is also
get_close_matches, which might be useful to you, you can specify a distance cutoff and it’ll return all matches within that distance from a list:Update: to find a partial sub-sequence match
To find close matches to a three word sequence, I would split the text into words, then group them into three word sequences, then apply
difflib.get_close_matches, like this: