To preface this, I know there are discussions on this in various places. Half

Question

0

Asked: May 16, 20262026-05-16T04:14:44+00:00 2026-05-16T04:14:44+00:00

To preface this, I know there are discussions on this in various places. Half

0

To preface this, I know there are discussions on this in various places.
Half of what I read is outdated, buggy or simply unrelated to my situation.

This is why I am bringing it to the community that I know will have the answers.

Question: I have a directory (online is ideal) of around 70,000 pages in PDF documents (documents range from 20 – 100s of pages, add up to around 70,000 pages).

I am looking for a method, script or idea for the easiest way to search these PDFs for products. The PDFs all have a text layer that was created by OCR in Acrobat.

Any ideas, whether they be elaborate or inventive, are more than welcome.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-16T04:14:45+00:00

Editorial Team

2026-05-16T04:14:45+00:00Added an answer on May 16, 2026 at 4:14 am

My recommendation would be Apache Solr (a search server built using Lucene) and is dead simple to use using it RESTful interface. It also has a subproject called Tika which extracts metadata and structured text content from multiple formats (incl. PDF).

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

To preface this, I know there are discussions on this in various places. Half

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply