I am working on a Python/django web application and I need to extract text from scanned documents (for search indexing).
What options are there for OCR engines? I know of tesseract, but I am not entirely satisfied with the results. The problem could perhaps be solved by more extensive pre-processing (rotation, level adjustment, etc.).
Requirements:
- Should not require manual tuning (other than initial tuning)
- Preferably open source, alternatively should be possible to buy “liberal” license
- Either Python module, or command-line program (or C-library that I can turn into a command-line program 🙂 )
Alternatively:
- A good library that does image pre-processing so that an existing engine like tesseract will perform better.
Tesseract itself can be optionally made to compile with Leptonica, a library with a pretty exhaustive set of image manipulation (I’m not sure if Tesseract itself uses it for anything more than supporting more than just the basic TIF format). A thorough list of features can be found on the website. The project author, Dan Bloomberg, has written a few papers on image preprocessing for OCR, which too might be of interest to you — you could find them by doing a
site: http://www.leptonica.com/papers/Google search.