I am looking for a library (if possible available in Java or PHP) in order to extract text from a PDF. There is a lot of software available, including:
-
3-Heights™ PDF Extract http://www.pdf-tools.com/pdf/pdf-extract-content-metadata-text.aspx
-
PDFlib TET – Text Extraction Toolkit http://www.pdflib.com/products/tet/
Which tools would you choose? What do you think of them?
Thank you very much for your kind help!
My favourite is iText (java) but extracting text from a PDF can be fraught with difficulties as the text in the PDF is not alway stored in the order in which it appears.