I am building a web application with Perl. Users send me an XML file with among other things references to a number of PDF documents. I use XSLT to transform the XML to XHTML, and then use PrinceXML to create a PDF document from the XHTML. This PDF reserves empty pages with headers and footers for the attachments that will be included.
Once I have the PDF, I use the PDF::API2 Perl module to open the PDF documents referenced in the XML one by one, scale and rotate the pages if required, and then include them in the PDF document that I created.
My problem is that many of the PDFs submitted by the users are broken in some way. I.e., they do not conform to Adobe’s PDF specifications, and PDF::API2 does not know how to manipulate them. The PDF::API2 documentation suggests using pdftk to repair broken PDFs, but this often takes a long time and is in many cases not successful.
What is the best way to repair such broken PDFs?
What you advocate here is sometimes called ‘re-frying the PDFs’: conversion to PostScript and back to PDF.
However, while this can possibly fix some problems which may not be easily fixable with other methods, you should also be aware of the problems and shortcomings which regularly lay along this path:
PostScript’s graphic capabilities are more limited than PDFs. PDF has added support for real transparency, more color spaces, ICC color profiles and more font types — features which aren’t available in PostScript. (In fact the need to add such features to the original PostScript graphic model was one of the incentives for Adobe to start developing the PDF file format at all!)
So going from PDF to PostScript will necessarily tend to loose quality, which you’ll not get back when converting back to PDF.
However, there is another alternative which you could try, that avoids the re-frying detour:
Convert PDF -> PDF directly with the help of Ghostscript:
Please use the most recent Ghostscript version that’s available for this.
Ghostscript has a lot of options which you can use to control individual aspects of the PDF repair process. Without knowing your specific problems, I cannot be more specific here.
But in the past 10 years I haven’t encountered many a PDF problem that Ghostscript couldn’t repair, while re-frying via Acroread could do it (though there are a few of them). OTOH, I had many more examples where Acroread’s re-frying didn’t succeed, while Ghostscript’s PDF -> PDF did…