Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 31287
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 10, 20262026-05-10T13:34:51+00:00 2026-05-10T13:34:51+00:00

We get a large amount of data from our clients in pdf files in

  • 0

We get a large amount of data from our clients in pdf files in varying formats [layout-wise], these files are typically report output, and are typically properly annotated [they don’t usually need OCR], but not formatted well enough that simply copying several hundred pages of text out of acrobat is not going to work.

The best approach I’ve found so far is to write a script to parse the nearly-valid xml output (the comments are invalid and many characters are escaped in varying ways, é becomes [[[e9]]]é, $ becomes \$, % becomes \%…) of the command-line pdftoipe utility (to convert pdf files for a program called ipe), which gives me text elements with their positions on each page [see sample below], which works well enough for reports where the same values are on the same place on every page I care about, but would require extra scripting effort for importing matrix [cross-tab] pdf files. pdftoipe is not at all intended for this, and at best can be compiled manually using cygwin for windows.

Are there libraries that make this easy from some scripting language I can tolerate? A graphical tool would be awesome too. And a pony.

pdftoipe output of this sample looks like this:

<ipe creator='pdftoipe 2006/10/09'><info media='0 0 612 792'/> <-- Page: 1 1 --> <page gridsize='8'> <path fill='1 1 1' fillrule='wind'> 64.8 144 m 486 144 l 486 727.2 l 64.8 727.2 l 64.8 144 l h </path> <path fill='1 1 1' fillrule='wind'> 64.8 144 m 486 144 l 486 727.2 l 64.8 727.2 l 64.8 144 l h </path> <path fill='1 1 1' fillrule='wind'> 64.8 144 m 486 144 l 486 727.2 l 64.8 727.2 l 64.8 144 l h </path> <text stroke='1 0 0' pos='0 0' size='18' transformable='yes' matrix='1 0 0 1 181.8 707.88'>This is a sample PDF fil</text> <text stroke='1 0 0' pos='0 0' size='18' transformable='yes' matrix='1 0 0 1 356.28 707.88'>e.</text> <text stroke='1 0 0' pos='0 0' size='18' transformable='yes' matrix='1 0 0 1 368.76 707.88'> </text> <text stroke='0 0 0' pos='0 0' size='12.6' transformable='yes' matrix='1 0 0 1 67.32 692.4'> </text> <text stroke='0 0 0' pos='0 0' size='12.6' transformable='yes' matrix='1 0 0 1 67.32 677.88'> </text> <text stroke='0 0 0' pos='0 0' size='12.6' transformable='yes' matrix='1 0 0 1 67.32 663.36'> </text> <text stroke='0 0 0' pos='0 0' size='12.6' transformable='yes' matrix='1 0 0 1 67.32 648.84'> </text> <text stroke='0 0 0' pos='0 0' size='12.6' transformable='yes' matrix='1 0 0 1 67.32 634.32'> </text> <text stroke='0 0 0' pos='0 0' size='12.6' transformable='yes' matrix='1 0 0 1 67.32 619.8'> </text> <text stroke='0 0 0' pos='0 0' size='12.6' transformable='yes' matrix='1 0 0 1 67.32 605.28'> </text> <text stroke='0 0 0' pos='0 0' size='12.6' transformable='yes' matrix='1 0 0 1 67.32 590.76'> </text> <text stroke='0 0 0' pos='0 0' size='12.6' transformable='yes' matrix='1 0 0 1 67.32 576.24'> </text> <text stroke='0 0 0' pos='0 0' size='12.6' transformable='yes' matrix='1 0 0 1 67.32 561.72'> </text> <text stroke='0 0 0' pos='0 0' size='12.6' transformable='yes' matrix='1 0 0 1 67.32 547.2'> </text> <text stroke='0 0 0' pos='0 0' size='12.6' transformable='yes' matrix='1 0 0 1 67.32 532.68'> </text> <text stroke='0 0 0' pos='0 0' size='12.6' transformable='yes' matrix='1 0 0 1 67.32 518.16'> </text> <text stroke='0 0 0' pos='0 0' size='12.6' transformable='yes' matrix='1 0 0 1 67.32 503.64'> </text> <text stroke='0 0 0' pos='0 0' size='12.6' transformable='yes' matrix='1 0 0 1 67.32 489.12'> </text> <text stroke='0 0 0' pos='0 0' size='12.6' transformable='yes' matrix='1 0 0 1 67.32 474.6'> </text> <text stroke='0 0 1' pos='0 0' size='16.2' transformable='yes' matrix='1 0 0 1 67.32 456.24'>If you can read this</text> <text stroke='0 0 1' pos='0 0' size='16.2' transformable='yes' matrix='1 0 0 1 214.92 456.24'>,</text> <text stroke='0 0 1' pos='0 0' size='16.2' transformable='yes' matrix='1 0 0 1 219.48 456.24'> you already have A</text> <text stroke='0 0 1' pos='0 0' size='16.2' transformable='yes' matrix='1 0 0 1 370.8 456.24'>dobe Acrobat </text> <text stroke='0 0 1' pos='0 0' size='16.2' transformable='yes' matrix='1 0 0 1 67.32 437.64'>Reader i</text> <text stroke='0 0 1' pos='0 0' size='16.2' transformable='yes' matrix='1 0 0 1 131.28 437.64'>n</text> <text stroke='0 0 1' pos='0 0' size='16.2' transformable='yes' matrix='1 0 0 1 141.12 437.64'>stalled on your computer.</text> <text stroke='0 0 0' pos='0 0' size='16.2' transformable='yes' matrix='1 0 0 1 337.92 437.64'> </text> <text stroke='0 0.502 0' pos='0 0' size='12.6' transformable='yes' matrix='1 0 0 1 342.48 437.64'> </text> <image width='800' height='600' rect='-92.04 800.64 374.4 449.76' ColorSpace='DeviceRGB' BitsPerComponent='8' Filter='DCTDecode' length='369925'> feedcafebabe... </image> </page> </ipe> 
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. 2026-05-10T13:34:52+00:00Added an answer on May 10, 2026 at 1:34 pm

    We use Xpdf in one of our applications. Its a c++ library which is primarily used for pdf rendering, although it does have a text extractor which could be useful for this project.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

On our site, we get a large amount of photos uploaded from various sources.
I have to get a large amount of data from a server to a
I need to get a large amount of data back from a server via
I need to get large amount of data (say 7000 records) from online database
Suppose you want to get a record from database which returns a large amount
I am working on a project involving large amount of data from the delicious
You need to pull a large amount of data (thousands of entries) from a
I am trying to load a large amount data in SQL server from a
In my Android application I need to read a large amount of data from
I have a large amount of data that is retrieved from a database. They

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.