i need to extract some information from a pdf stream. It’s quite simple to

Question

0

Asked: June 12, 20262026-06-12T05:09:14+00:00 2026-06-12T05:09:14+00:00

i need to extract some information from a pdf stream. It’s quite simple to

0

i need to extract some information from a pdf stream.
It’s quite simple to extract the relevant text, since it is something like:

BT /Fo0 7.20 Tf 67.81 569.38 Td 0.000 Tc (TOTAL AMOUNT) Tj ET

I can consider fixed the y position, while the x position is variable due to giustification.
But my problem is recognize the beginning of a page and its end.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-12T05:09:15+00:00

You shouldn’t be sure that all the PDFs you encounter with your ‘information extractor’ are behaving so nicely. Or can you be, because you know they are?

Otherwise, it can very well happen that the PDF code which you encounter looks like:

BT 
  /Fo0 7.20 Tf 
  67.81 569.38 Td 
  0.000 Tc 
  (TO)12(T)13(AL A)11(M)14(OUNT) TJ 
ET

That is, …

…using TJ instead of Tj, to allow individal glyph positioning,
…having more linebreaks,
…and maybe many more modifikations.

In order to reliably get to the page’s text content, you have to parse the structure of the PDF, in short:

find all objects of /Type /Page;
go to each of these page objects and retrieve the info about which its respective /Contents is;
- the /Contents may point to single stream, or
- the /Contents may point to an array of streams;
go to this content object and extract its stream(s).

In practical terms, the first of the above steps can turn out a bit more complicated:

find and go to the trailer <<...>> section
in the trailer locate the info about the document’s /Root object
go to the root object
extract the info about the /Pages from the /Root object
go to the /Pages object (which is an intermedia page tree node with kids and parent;
find all descendands of this page tree node from inspecting the /Kids object
go to each respective object listed by /Kids;
- it could be of /Type /Pages (in which case it is another page tree node, not a tree leaf, and you have to follow down the tree further on);
- it could be of /Type Page (in which case you arrived a a page tree leaf which means you really arrived at a page).

At this point I should note, that the first page you found following this journey is page 1. The next is page 2, etc. Note, that no page has any metadata saying “I’m page number N” — it’s all depending on the order you parse the page tree staring from the root object.

Now that you really found content streams, you are facing two more problems:

The content streams you are looking for may not be in clear text at all (like your code showed). Content streams are very frequently compressed by one of the allowed compression schemes, and you’ll have to expand them before you can parse for text content.

To see if a stream is compressed, watch out for the respective *Decode keyword (very frequently appearing as /Filter /FlateDecode).
Once you successfully uncompressed the page’s content stream, you may encounter totally un-intuitive character codes describing your text. It may not at all be the same type of well behaving ASCII as you imagine and showed in your example code.

You’ll have to look up fonts (even multi-byte fonts like CID), their encodings, CMaps and what-not.

Unless, as I questioned in my initial sentence, you know that’s not happening in your specific use case…

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

i need to extract some information from a pdf stream. It’s quite simple to

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply