I’m using tesseract OCRwith python-tesseract.
In the tesseract FAQ, regarding digits, we have:
Use
TessBaseAPI::SetVariable("tessedit_char_whitelist", "0123456789");BEFORE calling an Init function or put this in a text file called
tessdata/configs/digits:
tessedit_char_whitelist 0123456789and then your command line becomes:
tesseract image.tif outputbase nobatch digitsWarning: Until the old and new config variables get merged, you must
have the nobatch parameter too.
In python-tesseract, the SetVariable method exists. I’ve tried this, but the result of the OCR is the same:
api = tesseract.TessBaseAPI()
api.SetVariable("tessedit_char_whitelist", "0123456789")
api.Init('.','eng',tesseract.OEM_DEFAULT)
api.SetPageSegMode(tesseract.PSM_AUTO)
Did anyone already got this working, or should I consider it a bug in python-tesseract?
OK, got it working.
According to this (unofficial ?) documentation of tesseract-ocr, SetVariable() must be called after Init(), even though the opposite is said in the official FAQ.
Calling it after Init() works as intended.