I have over 30,000 pdf files. Some files are already OCR and some are

Question

0

Asked: May 26, 20262026-05-26T03:12:27+00:00 2026-05-26T03:12:27+00:00

I have over 30,000 pdf files. Some files are already OCR and some are

0

I have over 30,000 pdf files. Some files are already OCR and some are not. Is there a way to find out which files are already OCR’d and which pdfs are image only?

It will take for ever if I ran every single file through an OCR processor.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T03:12:28+00:00

I would write a small script to extract the text from the PDF files and see if it is “empty”. If there is text the PDF already was OCRed. You could either use ghostscript or XPDF to extract the text.

EDIT:
This should get you started:

foreach ($pdffile in get-childitem -filter *.pdf){
    $pdftext=invoke-expression ("\path\to\xpdf\pdftotext.exe '"+$pdffile.fullname+"' -");
    write-host $pdffile.fullname
    write-host $pdftext.length;
    write-host $pdftext;
    write-host "-------------------------------";
}

Unfortunately even when you have only images in your PDF pdftotext will extract some text, so you will have to do some more work to check whether you need to OCR the pdf.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have over 30,000 pdf files. Some files are already OCR and some are

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply