What is the best way to programmatically check if a PDF file is a totally scanned one?
I do have iText and PDFBox at my disposal. I can check if a pdf file contains text or not, and according to the result to decide if this file is OCRed, but this solution is not 100% accurate. I’d like to know whether there is another way to cope with the problem.
As you understand the solution must be Java based.
Your best bet might be to check to see if it has text and also see if it contains a large pagesized image or lots of tiled images which cover the page. If you also check the metadata this should cover most options.