I need to invoke tesseract OCR (its an open source library in C++ that does Optical Character Recognition) from a Java Application Server. Right now its easy enough to run the executable using Runtime.exec(). The basic logic would be
- Save image that is currently held in memory to file (a .tif)
- pass in the image file name to the tesseract command line program.
- read in the output text file from Java using FileReader.
How much improvement in terms of performance am I likely to get by writing a JNI wrapper for Tesseract? Unfortunately there is not an open source JNI wrapper that works in Linux. I would have to do it myself and am wondering about whether the benefit is worth the development cost.
It’s hard to say whether it would be worth it. If you assume that if done in-process via JNI, the OCR code can directly access the image data without having to write it to a file, then it would certainly eliminate any disk I/O constraints there.
I’d recommend going with the simpler approach and only undertaking the JNI option if performance is not acceptable. At least then you’ll be able to do some benchmarking and estimate the performance gains you might be able to realize.