Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8505223
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 11, 20262026-06-11T02:15:29+00:00 2026-06-11T02:15:29+00:00

I have a PDF file containing some tabular data. http://dl.dropbox.com/u/44235928/sample_rotate-0.pdf I have to extract

  • 0

I have a PDF file containing some tabular data.

http://dl.dropbox.com/u/44235928/sample_rotate-0.pdf

I have to extract the tabular data from it. I have tried following with no success :

  1. Select the text and paste it to notepad/excel-sheet. (I am getting junk characters)
  2. Used save as text from Acrobat Reader. It is also giving junk characters and not the actual text.
  3. Tried ApachePDFBox command line utility to extract text from PDF. It is also giving junk characters instead of real texts.
  4. Finally I am trying a OCR solution. I am converting the pdf file into .tif images using ImageMagick and getting those images processed by tesseract OCR.

The OCR solution is not very accurate though( about 80% words matched ).

I tried changing density and geometry of the image created from PDF to get better results from tesseract OCR.

convert -rotate 90 -geometry 10000 -depth 8 -density 800 sample.pdf img_800_10000.tif;
tesseract img_800_10000.tif img_800_10000.tif nobatch letters;

I am not sure for what kind of image( density, geometry, monochromatic, sharpen boundary etc) would be best suited for the OCR.

Please suggest what could be the best possible parameters(density,geometry,depth etc) for generating images from a PDF file, so that the tesseract accuracy will increase.

I am open to other( non-ocr ) solutions as well.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-11T02:15:30+00:00Added an answer on June 11, 2026 at 2:15 am

    In this case I recommend to NOT use ImageMagick for the PDF -> TIFF conversion. Instead, use Ghostscript. Two reasons:

    1. Using Ghostscript directly will give you more control over individual parameters of the conversion.

    2. ImageMagick cannot do that particular conversion itself — it will call Ghostscript as its ‘delegate’ anyway, but will not allow you to give all the same fine-grained control that your own Ghostscript command will give you.

    Most of the text in the table of your sample PDF is extremely small (I guess, only 4 or 5 pt high). This makes it rather difficult to run a successful OCR unless you increase the resolution considerably.

    Ghostscript uses -r72 by default for image format output (such as TIFF). Tesseract works best with r=300 or r=400 — but only for a font size from 10-12 pt or higher. Therefor, to compensate for the small text size you should make Ghostscript using a resolution of at least 1200 DPI when it renders the PDF to the image.

    Also, you’ll have to rotate the image so the text displays in the normal reading direction (not bottom -> top).

    This is the command which I would try first:

    gs                              \
      -o sample.tif                 \
      -sDEVICE=tiffg4               \
      -r1200                        \
      -dAutoRotatePages=/PageByPage \
       sample_rotate-0.pdf
    

    You may need to play with variations of the -r1200 parameter (higher or lower) for best results.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have some code to read from a pdf file. Is there a way
I have a file containing some Latex : \begin{figure}[ht] \centering \includegraphics[scale=0.15]{logo.pdf} \caption{Example of a
If I goto http://site.com/uploads/file.pdf I can retrieve a file. However, if I have a
I have some images and pdf file on my server which I use in
I have a PDF file that functions as a template containing only textboxes. Is
I have a variable containing wildarded file descriptors: FORMATS='*.mobi *.pdf *.txt *.epub *.lit' It
I have PDF file data in a SQL Server database, in the column type
I have a pdf file with some table inside. I want to read this
i have a pdf file i want to change a table cell data by
I have created a pdf file containing a table in many pages and I'm

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.