I’m looking for a utility or library for extracting text from PDFs and formatting it in plain text while keeping as much of the original layout as possible (eg tables, columns etc.).
We’re currently using pdftotext but I was wondering if there’s anything better. It has to be a command-line tool or a library we can link into our app.
Is pdftotext as good as it gets, or is there something better?
For the benefit of others with the same problem: We ended up staying with
pdftotextdespite its drawbacks (like producing garbage output sometimes when font subsets are used).See also: http://www.glyphandcog.com/textext.html