I need to do some OCR on a large chunk of text and check

Question

0

Asked: June 5, 20262026-06-05T04:23:40+00:00 2026-06-05T04:23:40+00:00

I need to do some OCR on a large chunk of text and check

0

I need to do some OCR on a large chunk of text and check if it contains a certain string but due to the inaccuracy of the OCR I need it to check if it contains something like a ~85% match for the string.

For example I may OCR a chunk of text to make sure it doesn’t contain no information available but the OCR might see n0 inf0rmation available or misinterpret an number of characters.

Is there an easy way to do this in Python?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-05T04:23:42+00:00

As posted by gauden, SequenceMatcher in difflib is an easy way to go. Using ratio(), returns a value between 0 and 1 corresponding to the similarity between the two strings, from the docs:

Where T is the total number of elements in both sequences, and M is
the number of matches, this is 2.0*M / T. Note that this is 1.0 if the
sequences are identical, and 0.0 if they have nothing in common.

example:

>>> import difflib
>>> difflib.SequenceMatcher(None,'no information available','n0 inf0rmation available').ratio()
0.91666666666666663

There is also get_close_matches, which might be useful to you, you can specify a distance cutoff and it’ll return all matches within that distance from a list:

>>> difflib.get_close_matches('unicorn', ['unicycle', 'uncorn', 'corny', 
                              'house'], cutoff=0.8)
['uncorn']
>>> difflib.get_close_matches('unicorn', ['unicycle'  'uncorn', 'corny',
                              'house'], cutoff=0.5)
['uncorn', 'corny', 'unicycle']

Update: to find a partial sub-sequence match

To find close matches to a three word sequence, I would split the text into words, then group them into three word sequences, then apply difflib.get_close_matches, like this:

import difflib
text = "Here is the text we are trying to match across to find the three word
        sequence n0 inf0rmation available I wonder if we will find it?"    
words = text.split()
three = [' '.join([i,j,k]) for i,j,k in zip(words, words[1:], words[2:])]
print difflib.get_close_matches('no information available', three, cutoff=0.9)
#Oyutput:
['n0 inf0rmation available']

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I need to do some OCR on a large chunk of text and check

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply