Have an interesting problem and am looking for the right solution. We have around

Question

0

Asked: June 9, 20262026-06-09T14:03:04+00:00 2026-06-09T14:03:04+00:00

Have an interesting problem and am looking for the right solution. We have around

0

Have an interesting problem and am looking for the right solution. We have around 100,000 PDF documents of varying sizes, with the average size being 150 pages. It is currently on a RAID6 server and is backed up off-site as well. There is a total of 6.5TB worth of PDFs we need to index.

We are currently converting the PDFs into text files and storing them in a similar folder structure on the server. These will then need to be indexed and made searchable including back links to the original folder. The text files use the same name as the PDF with an additional naming convention added onto them. If my estimates are correct, this puts it close to 4 billion words that will need to be indexed.

What would be a suitable solution for indexing these files?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-09T14:03:05+00:00

Editorial Team

2026-06-09T14:03:05+00:00Added an answer on June 9, 2026 at 2:03 pm

I would take a look at SOLR. We are currently looking into using it as a full-text search engine for documents. It’s widely used and well supported.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Have an interesting problem and am looking for the right solution. We have around

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply