i need to extract some information from a pdf stream.
It’s quite simple to extract the relevant text, since it is something like:
BT /Fo0 7.20 Tf 67.81 569.38 Td 0.000 Tc (TOTAL AMOUNT) Tj ET
I can consider fixed the y position, while the x position is variable due to giustification.
But my problem is recognize the beginning of a page and its end.
You shouldn’t be sure that all the PDFs you encounter with your ‘information extractor’ are behaving so nicely. Or can you be, because you know they are?
Otherwise, it can very well happen that the PDF code which you encounter looks like:
That is, …
TJinstead ofTj, to allow individal glyph positioning,In order to reliably get to the page’s text content, you have to parse the structure of the PDF, in short:
/Type /Page;/Contentsis;/Contentsmay point to single stream, or/Contentsmay point to an array of streams;In practical terms, the first of the above steps can turn out a bit more complicated:
trailer <<...>>section/Rootobject/Pagesfrom the/Rootobject/Pagesobject (which is an intermedia page tree node with kids and parent;/Kidsobject/Kids;/Type /Pages(in which case it is another page tree node, not a tree leaf, and you have to follow down the tree further on);/Type Page(in which case you arrived a a page tree leaf which means you really arrived at a page).At this point I should note, that the first page you found following this journey is page 1. The next is page 2, etc. Note, that no page has any metadata saying “I’m page number N” — it’s all depending on the order you parse the page tree staring from the root object.
Now that you really found content streams, you are facing two more problems:
The content streams you are looking for may not be in clear text at all (like your code showed). Content streams are very frequently compressed by one of the allowed compression schemes, and you’ll have to expand them before you can parse for text content.
To see if a stream is compressed, watch out for the respective *Decode keyword (very frequently appearing as
/Filter /FlateDecode).Once you successfully uncompressed the page’s content stream, you may encounter totally un-intuitive character codes describing your text. It may not at all be the same type of well behaving ASCII as you imagine and showed in your example code.
You’ll have to look up fonts (even multi-byte fonts like CID), their encodings, CMaps and what-not.
Unless, as I questioned in my initial sentence, you know that’s not happening in your specific use case…