Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8298773
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 8, 20262026-06-08T15:54:23+00:00 2026-06-08T15:54:23+00:00

Hi so I’m trying to parse some text from some pdfs and I would

  • 0

Hi so I’m trying to parse some text from some pdfs and I would like to use PoDoFo, now I have tried searching for examples of how to use PoDoFo to parse a pdf however all I can come up with is examples of how to create and write a pdf file which is not what I really need.

If anyone has any tutorial or example of parsing a PDF file with PoDoFo or have suggestions for a different library that I can use please let me know. Also I know there is pdftotext on linux, however, not only can I not use that, but I would much rather be able to do everything I need to internally and not rely on outside programs being installed.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-08T15:54:25+00:00Added an answer on June 8, 2026 at 3:54 pm

    PoDoFo does not provide a means to easily extract text from a document, but it is not hard to do.

    Load a document into a PdfMemDocument:

    PoDoFo::PdfMemDocument pdf("mydoc.pdf");
    

    Iterate over each page:

    for (int pn = 0; pn < pdf.GetPageCount(); ++pn) {
        PoDoFo::PdfPage* page = pdf.GetPage(pn);
    

    Iterate over all the PDF commands on that page:

        PoDoFo::PdfContentsTokenizer tok(page);
        const char* token = nullptr;
        PoDoFo::PdfVariant var;
        PoDoFo::EPdfContentsType type;
        while (tok.ReadNext(type, token, var)) {
            switch (type) {
                case PoDoFo::ePdfContentsType_Keyword:
                    // process token: it contains the current command
                    //   pop from var stack as necessary
                    break;
                case PoDoFo::ePdfContentsType_Variant:
                    // process var: push it onto a stack
                    break;
                default:
                    // should not happen!
                    break;
            }
        }
    }
    

    The “process token” & “process var” comments is where it gets a little more complex. You are given raw PDF commands to process. Luckily, if you’re not actually rendering the page and all you want is the text, you can ignore most of them. The commands you need to process are:

    BT, ET, Td, TD, Ts, T, Tm, Tf, ", ', Tj and TJ

    The BT and ET commands mark the beginning and end of a text stream, so you want to ignore anything that’s not between a BT/ET pair.

    The PDF language is RPN based. A command stream consists of values which are pushed onto a stack and commands which pop values off the stack and process them.

    The ", ', Tj and TJ commands are the only ones which actually generate text. ", ' and Tj return a single string. Use var.IsString() and var.GetString() to process it.

    TJ returns an array of strings. You can extract each one with:

    if (var.isArray()) {
        PoDoFo::PdfArray& a = var.GetArray();
        for (size_t i = 0; i < a.GetSize(); ++i)
            if (a[i].IsString())
                // do something with a[i].GetString()
    

    The other commands are used to determine when to introduce a line break. " and ' also introduce line breaks. Your best bet is to download the PDF spec from Adobe and look up the text processing section. It explains what each command does in more detail.

    I found it very helpful to write a small program which takes a PDF file and dumps out the command stream for each page.

    Note: If all you’re doing is extracting raw text with no positioning information, you don’t actually need to maintain a stack of var values. All the text rendering commands have, at most, one parameter. You can simply assume that the last value in var contains the parameter for the current command.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

For some reason, after submitting a string like this Jack’s Spindle from a text
I have just tried to save a simple *.rtf file with some websites and
I have a string like this: La Torre Eiffel paragonata all&#8217;Everest What PHP function
I have a text area in my form which accepts all possible characters from
I have a bunch of posts stored in text files formatted in yaml/textile (from
I have some data like this: 1 2 3 4 5 9 2 6
I am trying to understand how to use SyndicationItem to display feed which is
I would like to count the length of a string with PHP. The string
this is what i have right now Drawing an RSS feed into the php,
I am trying to render a haml file in a javascript response like so:

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.