Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 3305094
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 17, 20262026-05-17T21:07:52+00:00 2026-05-17T21:07:52+00:00

Note: I am not interested in using a parsing library. This is for my

  • 0

Note: I am not interested in using a parsing library. This is for my own entertainment.

I’ve been experimenting with ripping text out of PDF files for a search gizmo, but am unable to extract text from some pdf files.

Note that this is a much easier problem than straight up parsing; I don’t care if I inadvertently include some garbage in my output, nor do I really care if the formatting of the document is intact. I don’t even care if the words come out in order.

As a first step, I created a very simple pdf parser using the strategy found on this project. Basically, all it does is search pdf files for zlib streams, deflates them, and pulls out any text it finds in parentheses. This fails to parse data stuck inside of << >> blocks, but my understanding is that this is for hex-encoded blobs of data, which doesn’t seem to be in the test file that I am failing to parse…or at least I don’t see them.

Similarly, iText.Net also fails, though PDFMiner and PDFBox succeed. However, the latter two projects have too many layers of indirection to be easily examined; I had trouble figuring out exactly what they were doing, in part because I don’t really use either language enough to be accustomed to debugging it in any significant manner.

My goal is to create a text ripper grabs text out of a pdf file with as little understanding of the pdf format itself as possible (e.g. my test parser grabs text out of parentheses, but has no understanding of which portion of the pdf it is examining is the header).

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-17T21:07:53+00:00Added an answer on May 17, 2026 at 9:07 pm

    Extracting content out of a PDF file can get a little complex. I do this as my daily job, and I think I can point you to the right direction.

    What you are trying to do (extracting string between parentheses) works with simple WinAnsi or MacRoman encoding only, used with Type1 or TrueType fonts. Unfortunately these single-byte encodings do not support proper Unicode content. Your sample document uses Type0 aka CID fonts, where each character is identified by a glyph index. These are non-standard, ad-hoc encodings, where the designer of the font may assign a glyph index to any character in an arbitrary way. Sometimes the producer of the PDF intentionally mangles the encoding.

    The way it works is that starting with the catalog, you parse the page tree. Once you identify a page object, you parse its contents as well as its resources. The resources dictionary contains a list of fonts used by the page. Each CID font object contains a ToUnicode stream, which is a cmap (character map), which establishes the relationship between the glyph indexes and their Unicode value. For example:

    <01> <0044>
    <02> <0061>
    <03> <0074>
    <04> <0020>
    

    This means the glyph 01 is Unicode U+0044, the glyph 02 is U+0061, and so on. You have to use this lookup table to translate glyph IDs back into Unicode.

    The page content itself has two important operators for you. The Tf is the font selector, which is important, because it identifies the font object. Each font has its own ToUnicode cmap, therefore depending on the font you must use a different lookup table.

    The other interesting operator is the text show (typically TJ or Tj). With Type0 (CID) fonts the Tj doesn’t contain human readable text, but instead a sequence of glyph IDs that you are supposed to map into Unicode with the help of the above mentioned cmap. Often the Tj uses hex string, such as <000100a50056> Tj, instead of the more typical (Hello, World) Tj that you are familiar with. Either way, the string is not human readable, and cannot be extracted without fully parsing the page, including all of its font resources, esp. the ToUnicode cmap, which by itself is a PostScript object, but you only care about the hex portions.

    Of course I have oversimplified the process, because there are dozens of different standard encodings, custom encodings (differential or ToUnicode), and we haven’t even touched Arabic, Hindi, vertical Japanese fonts, Type3 fonts, etc. Sometimes the text cannot be extracted at all, because it’s intentionally mangled.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Note: This is not about using both node.js and HTML5 sockets. I'm also not
EDIT: Note. I should have mentioned I'm not interested in using the .Select, DataRowView,
Note: Not sure if this is the right stack, please tell if I should
I wrote a custom control inherited from WebControl. (Note: not a user control). using
(Note: This is not a question about what is the best way with code
Note: There was not any question with this kind of problem here or anywhere...
Note: this is not real information: $ ssh-keygen -t rsa -C "tekkub@gmail.com" Generating public/private
Note: I'm not a newb, and I've done this a gazillion times, but for
I've figured out how to do some basic parsing of a CCD using T-SQL
In C++, how do I combine (note: not add) two integers into one big

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.