Tesseract OCR engine sometimes outputs text that has no meaning, i want to design

Question

0

Asked: June 2, 20262026-06-02T07:06:52+00:00 2026-06-02T07:06:52+00:00

Tesseract OCR engine sometimes outputs text that has no meaning, i want to design

0

Tesseract OCR engine sometimes outputs text that has no meaning, i want to design an algorithm that neglects any text or word that has no meaning, below is some sort of output text that i want to neglect,my simple solution is to count the words in the recognized text that’s separated by ” ” and the text which has too many words will be garbage(Hint: i’m scanning images which at most will contains 40 words) any idea will be helpful,thanks.

 wo:>"|axnoA1wvw\
 ldﬂﬁg
 °J!9O‘ !P99W M9N 6 13!-|15!Cl ‘I-/Vl
 978 89l9 Z0 3+ 3 'l9.l.
 97 999 VLL lLOZ+ 3 9l!q°lN
 wo0'|axno/(@|au1e>1e: new;
 1=96r2a1ey\1 1uauud0|e/\e(]
 |8UJB){ p8UJL|\7'

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-02T07:06:54+00:00

Divide the output text into words. Divide the words into triples. Count the triple frequencies, and compare to triple frequencies from text of a known-good text corpus (EG all the articles from some mailing list discussing what you intend to OCR, minus the header lines).

When I say “triples”, I mean:

whe, hen, i, say, tri, rip, ipl, ple, les, i, mea, ean

…so “i” has a frequency of 2 in this short example, while the others are all frequency 1.

If you do a frequency count of each of these triples for a large document in your intended language, it should become possible to be reasonably accurate in guessing whether a string is in the same language.

Granted, it’s heuristic.

I’ve used a similar approach for detecting English passwords in a password changing program. It worked pretty well, though there’s no such thing as a perfect “obvious password rejecter”.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Tesseract OCR engine sometimes outputs text that has no meaning, i want to design

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply