-
is there anyway to perform OCR while uploading a document?
-
can we index the entire document?
-
can the search engine index the entire document? Even though users are required to pay to view the full document?
-
can the document be displayed as a preview with only the selected excerpt visible and the rest blurry with the format of the document still viewable?
I’ve been trying to find easy solutions to these questions using simple php functions or something that wouldn’t seem like rocket science to accomplish. But everywhere I look I see people talking about ApachePOI and Solr Cell and all these server commands that I have no idea about. For the last question, i could only figure out that we can use PHPGD and generate images with blurred content, but I wasnt sure how to make that work if there was formatted text, images and tables etc in the document.
So if someone has easy solutions, or even complicated solutions buts with EASY instructions, those will do. Something like “php document content extraction for noobs”, that will start from the a-b-c’s of it.
Thank you in advance!
Zend_Search_Lucene contains some code to read the docx file, which will run in PHP alone.
For PDF and doc, you can use command line utilities to extract the plain text content, such as catdoc or pdftotext. You can find such utilities for most file formats out there if you search around. They are usually packaged by most distributions.
From the raw text format, you can feed it to any full text search engine.