To preface this, I know there are discussions on this in various places.
Half of what I read is outdated, buggy or simply unrelated to my situation.
This is why I am bringing it to the community that I know will have the answers.
Question: I have a directory (online is ideal) of around 70,000 pages in PDF documents (documents range from 20 – 100s of pages, add up to around 70,000 pages).
I am looking for a method, script or idea for the easiest way to search these PDFs for products. The PDFs all have a text layer that was created by OCR in Acrobat.
Any ideas, whether they be elaborate or inventive, are more than welcome.
My recommendation would be Apache Solr (a search server built using Lucene) and is dead simple to use using it RESTful interface. It also has a subproject called Tika which extracts metadata and structured text content from multiple formats (incl. PDF).