Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8614579
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 12, 20262026-06-12T05:09:14+00:00 2026-06-12T05:09:14+00:00

i need to extract some information from a pdf stream. It’s quite simple to

  • 0

i need to extract some information from a pdf stream.
It’s quite simple to extract the relevant text, since it is something like:

BT /Fo0 7.20 Tf 67.81 569.38 Td 0.000 Tc (TOTAL AMOUNT) Tj ET

I can consider fixed the y position, while the x position is variable due to giustification.
But my problem is recognize the beginning of a page and its end.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-12T05:09:15+00:00Added an answer on June 12, 2026 at 5:09 am

    You shouldn’t be sure that all the PDFs you encounter with your ‘information extractor’ are behaving so nicely. Or can you be, because you know they are?

    Otherwise, it can very well happen that the PDF code which you encounter looks like:

    BT 
      /Fo0 7.20 Tf 
      67.81 569.38 Td 
      0.000 Tc 
      (TO)12(T)13(AL A)11(M)14(OUNT) TJ 
    ET
    

    That is, …

    • …using TJ instead of Tj, to allow individal glyph positioning,
    • …having more linebreaks,
    • …and maybe many more modifikations.

    In order to reliably get to the page’s text content, you have to parse the structure of the PDF, in short:

    1. find all objects of /Type /Page;
    2. go to each of these page objects and retrieve the info about which its respective /Contents is;
      • the /Contents may point to single stream, or
      • the /Contents may point to an array of streams;
    3. go to this content object and extract its stream(s).

    In practical terms, the first of the above steps can turn out a bit more complicated:

    • find and go to the trailer <<...>> section
    • in the trailer locate the info about the document’s /Root object
    • go to the root object
    • extract the info about the /Pages from the /Root object
    • go to the /Pages object (which is an intermedia page tree node with kids and parent;
    • find all descendands of this page tree node from inspecting the /Kids object
    • go to each respective object listed by /Kids;
      • it could be of /Type /Pages (in which case it is another page tree node, not a tree leaf, and you have to follow down the tree further on);
      • it could be of /Type Page (in which case you arrived a a page tree leaf which means you really arrived at a page).

    At this point I should note, that the first page you found following this journey is page 1. The next is page 2, etc. Note, that no page has any metadata saying “I’m page number N” — it’s all depending on the order you parse the page tree staring from the root object.

    Now that you really found content streams, you are facing two more problems:

    1. The content streams you are looking for may not be in clear text at all (like your code showed). Content streams are very frequently compressed by one of the allowed compression schemes, and you’ll have to expand them before you can parse for text content.

      To see if a stream is compressed, watch out for the respective *Decode keyword (very frequently appearing as /Filter /FlateDecode).

    2. Once you successfully uncompressed the page’s content stream, you may encounter totally un-intuitive character codes describing your text. It may not at all be the same type of well behaving ASCII as you imagine and showed in your example code.

      You’ll have to look up fonts (even multi-byte fonts like CID), their encodings, CMaps and what-not.

      Unless, as I questioned in my initial sentence, you know that’s not happening in your specific use case…

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a 1.3GB text file that I need to extract some information from
I need to extract some information from not very complicated HTML pages in web.
I am trying to extract some information from a binary file. It looks like
I need to extract information from an LDAP connection string like this one: ldap://uid=adminuser,dc=example,c=com:secret@ldap.example.com/dc=basePath,dc=example,c=com
I need to extract some management information (MI) from data which is updated in
I would like to extract 2 pieces information from a text file using the
Need to extract .co.uk urls from a file with lots of entries, some .com
I have a gb file and I need to extract some specific features from
I have a situation in which I need to extract Data Annotations information from
I need to make a web crawler to extract information from web pages. I

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.