I asked a similar question before, in stackoverflow. I wanted to ask another related question, so I am rephrasing the original question again.
I was using PDFBox to extract image and text from a pdf, available in skydrive and scribd. I had following code for extraction of text:
PDFTextStripper p = new PDFTextStripper();
String thistext=p.getText(document);
Which extracted the text properly. However, when I tried to extract images from the same pdf using ExtractImages class, the images produced were all pages of the pdf, not the actual images (which should be 1).
It appeared to me that the pdf could be a scanned document. The answer said the fact that it is scanned is your issue. I tried once more with pdftotext and pdfimages. The text is extracted, but pdfimages output 5 image files, which are all pages of the pdf (same as PDFBox).
As far I know, the raster images are stored as Xobjects in the pdf. When I opened the pdf with a text editor, I saw 5 appearances of following line:
<< /Type /XObject /Subtype /Image /Name /X /Width 2600 /Height 3799
Which is probably why PDFBox and XPDF output 5 pages of the pdf as image files. Then how is the text getting extracted from the pdf? Is there a technical documentation which mentions why (or how) text can be extracted from such a document, where the pages are “supposedly” embedded as XObjects. I can cite the documentation in my report.
Having inspected your PDF file the first guess in the comments to your question has been confirmed…
Your sample document is scanned and essentially consists of one bitmap image per page. When you zoom into the document, you can quickly see that all content looks fairly pixel’ish.
All the images have a resolution of 2600×3799 and are black and white.
These images have furthermore been OCR’ed and the resulting text has been invisibly added to the pages which allows for selecting, copying & pasting.
E.g. have a look at the top of page 885:
Its content stream starts like this:
Here /Im0, the page image, is inserted
Here addition of text is prepared; especially have a look at
3 Tr: This oparation sets the text rendering mode to3which is Neither fill nor stroke text (invisible). (section 9.3.6 Text Rendering Mode in ISO 32000-1:2008)Here you see text added, starting with an ‘A ‘ and an ‘%gust ‘. This actually shows that the result of the OCR’ing does not seem to have been properly checked as that should have been ‘August’. The low quality text information continues:
As you see many special characters and formulas have not or not correctly been recognized.