I need to extract plain text from uploaded documents in order to make them

Question

0

Asked: May 27, 20262026-05-27T17:01:03+00:00 2026-05-27T17:01:03+00:00

I need to extract plain text from uploaded documents in order to make them

0

I need to extract plain text from uploaded documents in order to make them searchable. Documents could be MS Word or pdf (either scanned or containing text). The application in question is running on a LAMP stack, but installing other software could be an option. Is there any tool, service, library or combination of those that you could recommend to accomplish this task?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T17:01:04+00:00

You can use a combination of shell utilities like pdftotext for PDFs, wvWare for DOCs, docx2txt.pl for DOCX’s, like the textractor rubygem does.

# on Ubuntu
apt-get install wv xpdf-utils links

There are also native php classes for extracting PDF and docx.

Another rubygem, which even does OCR for you though Tesseract, is docsplit.

It might be a good idea to consider Solr for indexing and searching. You may use the Solr Cell plugin to index and search Word documents, PDF’s and more. I use it successfully in one of my projects. Solr Cell is based on several projects like Apache POI, Tika and PDFBox.

The tricky part is to set up all the cell dependent jars and solr schema, and to figure out the indexing request parameters, but all can be thought out from the wiki documentation. Here’s my jars and schema to get you started, the relevant part of the schema is the line containing “attachment”.

Solr Cell does not do OCR, though. You will have to use an OCR Engine first to make them searchable.

For OCR you can use the OpenSource Engine Tesseract, which is developed by Google or you might want to have a look at the commercial engine Abbyy. Both come as commandline utils, which you can run from your php scripts. To get the comparable results from Tesseract as from Abbyy, you will have to do some pre- and postprocessing 1. There are also cloud services, which might be an easier option. For instance, Wisetrend and Abbyy Cloud. The latter is in beta at the moment, so it’s free of charge and it has ready-to-go PHP code samples.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I need to extract plain text from uploaded documents in order to make them

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply