I am working on a Python/django web application and I need to extract text

Question

0

Asked: May 24, 20262026-05-24T16:50:55+00:00 2026-05-24T16:50:55+00:00

I am working on a Python/django web application and I need to extract text

0

I am working on a Python/django web application and I need to extract text from scanned documents (for search indexing).

What options are there for OCR engines? I know of tesseract, but I am not entirely satisfied with the results. The problem could perhaps be solved by more extensive pre-processing (rotation, level adjustment, etc.).

Requirements:

Should not require manual tuning (other than initial tuning)
Preferably open source, alternatively should be possible to buy “liberal” license
Either Python module, or command-line program (or C-library that I can turn into a command-line program 🙂 )

Alternatively:

A good library that does image pre-processing so that an existing engine like tesseract will perform better.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-24T16:50:56+00:00

Tesseract itself can be optionally made to compile with Leptonica, a library with a pretty exhaustive set of image manipulation (I’m not sure if Tesseract itself uses it for anything more than supporting more than just the basic TIF format). A thorough list of features can be found on the website. The project author, Dan Bloomberg, has written a few papers on image preprocessing for OCR, which too might be of interest to you — you could find them by doing a site: http://www.leptonica.com/papers/ Google search.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am working on a Python/django web application and I need to extract text

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply