Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9293057
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 18, 20262026-06-18T21:01:59+00:00 2026-06-18T21:01:59+00:00

I asked a similar question before, in stackoverflow . I wanted to ask another

  • 0

I asked a similar question before, in stackoverflow. I wanted to ask another related question, so I am rephrasing the original question again.

I was using PDFBox to extract image and text from a pdf, available in skydrive and scribd. I had following code for extraction of text:

 PDFTextStripper p = new PDFTextStripper();
 String thistext=p.getText(document);

Which extracted the text properly. However, when I tried to extract images from the same pdf using ExtractImages class, the images produced were all pages of the pdf, not the actual images (which should be 1).

It appeared to me that the pdf could be a scanned document. The answer said the fact that it is scanned is your issue. I tried once more with pdftotext and pdfimages. The text is extracted, but pdfimages output 5 image files, which are all pages of the pdf (same as PDFBox).

As far I know, the raster images are stored as Xobjects in the pdf. When I opened the pdf with a text editor, I saw 5 appearances of following line:

<< /Type /XObject /Subtype /Image /Name /X /Width 2600 /Height 3799

Which is probably why PDFBox and XPDF output 5 pages of the pdf as image files. Then how is the text getting extracted from the pdf? Is there a technical documentation which mentions why (or how) text can be extracted from such a document, where the pages are “supposedly” embedded as XObjects. I can cite the documentation in my report.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-18T21:02:00+00:00Added an answer on June 18, 2026 at 9:02 pm

    Having inspected your PDF file the first guess in the comments to your question has been confirmed…

    Your sample document is scanned and essentially consists of one bitmap image per page. When you zoom into the document, you can quickly see that all content looks fairly pixel’ish.

    All the images have a resolution of 2600×3799 and are black and white.

    These images have furthermore been OCR’ed and the resulting text has been invisibly added to the pages which allows for selecting, copying & pasting.

    E.g. have a look at the top of page 885:

    top of your page 885

    Its content stream starts like this:

    1 0 0 1 -0.5998 -0.4801 cm
    1 1 1 rg
    1 i 
    /RelativeColorimetric ri
    /GS0 gs
    0 0 469.2 684.7 re
    f
    q
    467.9972 0 0 683.8015 0.6014 0.4492 cm
    /Im0 Do
    Q
    

    Here /Im0, the page image, is inserted

    1 0 0 1 0.5998 0.4801 cm
    0 0 0 rg
    BT
    /TT0 1 Tf
    3 Tr 9.8 0 0 10.4 35.8002 640.4199 Tm
    

    Here addition of text is prepared; especially have a look at 3 Tr: This oparation sets the text rendering mode to 3 which is Neither fill nor stroke text (invisible). (section 9.3.6 Text Rendering Mode in ISO 32000-1:2008)

    (A )Tj
    /TT1 1 Tf
    -0.01 Tc 8.8 0 0 9.5 43.4002 640.4199 Tm
    (%gust )Tj
    

    Here you see text added, starting with an ‘A ‘ and an ‘%gust ‘. This actually shows that the result of the OCR’ing does not seem to have been properly checked as that should have been ‘August’. The low quality text information continues:

    A %gust , 1978 SHORT PAPERS 885
    where
    and also
    Similarly for B. Also,
    T, = AY-l T
    as a result of the adiabatic cooling of the vapour.
    Stage 2:
    Here a volume of vapour and a volume of liquid I are removed and replaced with an
    equal volume of air containing concentrations Y and s of A and B, respectively. Of course,
    r or s may either or both be negligibly small, with subsequent simplification.
    

    As you see many special characters and formulas have not or not correctly been recognized.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I asked a similar question before regarding I/O using Java. I'm trying to copy
I've asked a similar question before, here: https://stackoverflow.com/questions/11707007/nested-json-form-submits-in-extjs4-getting-the-writer-to-remap-the-fields Also asked in the Sencha forums,
Another Newbie question in XSLT transformation. (I have asked similar question before, but in
I asked a similar question before, but it was answered inadequately so I thought
I asked a similar question before only to later discover that what I thought
OK, I asked a similar question before but was too confusing because of my
(Before I start, yes I have asked a similar question before; unfortunately due to
I am not sure if a similar question has been asked before, searched for
I've asked a similar question before, but nobody answered. How do I set HtmlAgilityPack
I have asked a similar question before ,but I have not obtained enough attention

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.