I am looking to take a PDF and extract any text from it. I then want to make it available using ColdFusion’s available Verity search to search the contents.
Are there any libraries out there that do this quite well already? I am including Java or .NET (Java prefered) libraries in the scope since they can be called from CF.
Any insights or experiences would be greatly appreciated… thanks!
Edit: Indexing PDF files works when the text is embedded in the PDF as far as I know with CF. The PDFs I’m having to deal with have the text scanned as an image.
If you have the ability to run your own software (i.e. Dedicated/VPS) then you could investigate using Tesseract OCR with
cfexecuteto convert the PDFs to text?