Have an interesting problem and am looking for the right solution. We have around 100,000 PDF documents of varying sizes, with the average size being 150 pages. It is currently on a RAID6 server and is backed up off-site as well. There is a total of 6.5TB worth of PDFs we need to index.
We are currently converting the PDFs into text files and storing them in a similar folder structure on the server. These will then need to be indexed and made searchable including back links to the original folder. The text files use the same name as the PDF with an additional naming convention added onto them. If my estimates are correct, this puts it close to 4 billion words that will need to be indexed.
What would be a suitable solution for indexing these files?
I would take a look at SOLR. We are currently looking into using it as a full-text search engine for documents. It’s widely used and well supported.