I have over 30,000 pdf files. Some files are already OCR and some are not. Is there a way to find out which files are already OCR’d and which pdfs are image only?
It will take for ever if I ran every single file through an OCR processor.
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
I would write a small script to extract the text from the PDF files and see if it is “empty”. If there is text the PDF already was OCRed. You could either use ghostscript or XPDF to extract the text.
EDIT:
This should get you started:
Unfortunately even when you have only images in your PDF
pdftotextwill extract some text, so you will have to do some more work to check whether you need to OCR the pdf.