I have a multi-page PDF file that has information I need to parse. The information and picture is confined to its own page. I need to extract the text and image from the PDF.
I’m using CentOS and PHP.
My attempt:
I originally tried using a combination of pdftotext and imagemagick. I converted the PDF into an image and that actually separated the pages into their own images. Unfortunately the quality of the image on the page came out very poor.
My goal:
I need to split the PDF into multiple PDFs, one per page. Then, I need to extract the image from that page with the best quality possible.
Thanks.
imagemagick does not fit to perform this task
when you need to extract images from a pdf, at their original size (i.e. the best, since any other resolution is or lesser or bigger than original), you must to use
pdfimages
http://www.foolabs.com/xpdf/download.html
(static binaries are available if you cannot compile from source)
syntax:
the image resulting will have the extension .ppm , unless you add the switch -j to have jpeg images as output