Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 1052539
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 16, 20262026-05-16T17:06:34+00:00 2026-05-16T17:06:34+00:00

I have several low quality pdfs. I would like to use OCR — to

  • 0

I have several low quality pdfs. I would like to use OCR — to be more precise Ocropus to get text from them. To do use, I use first ImageMagick — a command line tool to convert pdf to images — to transforms these pdfs into jpg or png.

However ImageMagick produces very low quality images and Ocropus hardly recognizes anything. I would like to learn what are the best parameters for handling low quality pdfs to provide as-good-as-possible-quality images to OCR.

I have found this page, but I do not know where to start.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-16T17:06:35+00:00Added an answer on May 16, 2026 at 5:06 pm

    You can learn about the detailed settings ImageMagick’s “delegates” (external programs IM uses, such as Ghostscript) by typing

    convert -list delegate
    

    (On my system that’s a list of 32 different commands.) Now to see which commands are used to convert to PNG, use this:

    convert -list delegate | findstr /i png
    

    Ok, this was for Windows. You didn’t say which OS you use. [*] If you are on Linux, try this:

    convert -list delegate | grep -i png
    

    You’ll discover that IM does produce PNG only from PS or EPS input. So how does IM get (E)PS from your PDF? Easy:

    convert -list delegate | findstr /i PDF
    convert -list delegate | grep -i PDF
    

    Ah! It uses Ghostscript to make a PDF => PS conversion, then uses Ghostscript again to make a PS => PNG conversion. Works, but isn’t the most efficient way if you know that Ghostscript can do PDF => PNG in one go. And faster. And in much better quality.

    About IM’s handling of PDF conversion to images via the Ghostscript delegate you should know two things first and foremost:

    1. By default, if you don’t give an extra parameter, Ghostscript will output images with a 72dpi resolution. That’s why Karl’s answer suggested to add -density 600 which tells Ghostscript to use a 600 dpi resolution for its image output.
    2. The detour of IM to call Ghostscript twice to convert first PDF => PS and then PS => PNG is a real blunder. Because you never win and harldy keep quality in the first step, but very often loose some. Reasons:
      • PDF can handle transparencies, which PostScript can not.
      • PDF can embed TrueType fonts, which Ghostscript can not. etc.pp.
        Conversion in the direction PS => PDF is not that critical….)

    That’s why I’d suggest you convert your PDFs in one go to PNG (or JPEG) using Ghostscript directly. And use the most recent version 8.71 (soon to be released: 9.01) of Ghostscript! Here are example commands:

    gswin32c.exe ^
      -sDEVICE=pngalpha ^
      -o output/page_%03d.png ^
      -r600 ^
      d:/path/to/your/input.pdf
    

    (This is the commandline for Windows. On Linux, use gs instead of gswin32c.exe, and \ instead of ^.) This command expects to find an output subdirectory where it will store a separate file for each PDF page. To produce JPEGs of good quality, try

    gs \
      -sDEVICE=jpeg \
      -o output/page_%03d.jpeg \
      -r600 \
      -dJPEGQ=95 \
      /path/to/your/input.pdf
    

    (Linux command version). This direct conversion avoids the intermediate PostScript format, which may have lost your TrueType font and transparency object’s information that were in the original PDF file.


    [*] D’oh! I missed to see your “linux” tag at first…

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have several old 3.5in floppy disks that I would like to backup. My
We have several jobs that run concurrently that have to use the same config
I have several ASP:TextBox controls on a form (about 20). When the form loads,
I have several RequiredFieldValidators in an ASP.NET 1.1 web application that are firing on
I have several tables whose only unique data is a uniqueidentifier (a Guid) column.
We have several wizard style form applications on our website where we capture information
I have several applications that are part of a suite of tools that various
I have several user controls, let's say A , B , C and D
We have several .NET applications that monitor a directory for new files, using FileSystemWatcher.
I have several log files of events (one event per line). The logs can

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.