I need to extract plain text from uploaded documents in order to make them searchable. Documents could be MS Word or pdf (either scanned or containing text). The application in question is running on a LAMP stack, but installing other software could be an option. Is there any tool, service, library or combination of those that you could recommend to accomplish this task?
Share
You can use a combination of shell utilities like
pdftotextfor PDFs,wvWarefor DOCs,docx2txt.plfor DOCX’s, like the textractor rubygem does.There are also native php classes for extracting PDF and docx.
Another rubygem, which even does OCR for you though Tesseract, is docsplit.
It might be a good idea to consider Solr for indexing and searching. You may use the Solr Cell plugin to index and search Word documents, PDF’s and more. I use it successfully in one of my projects. Solr Cell is based on several projects like Apache POI, Tika and PDFBox.
The tricky part is to set up all the cell dependent jars and solr schema, and to figure out the indexing request parameters, but all can be thought out from the wiki documentation. Here’s my jars and schema to get you started, the relevant part of the schema is the line containing “attachment”.
Solr Cell does not do OCR, though. You will have to use an OCR Engine first to make them searchable.
For OCR you can use the OpenSource Engine Tesseract, which is developed by Google or you might want to have a look at the commercial engine Abbyy. Both come as commandline utils, which you can run from your php scripts. To get the comparable results from Tesseract as from Abbyy, you will have to do some pre- and postprocessing 1. There are also cloud services, which might be an easier option. For instance, Wisetrend and Abbyy Cloud. The latter is in beta at the moment, so it’s free of charge and it has ready-to-go PHP code samples.