I am trying to get text data from a pdf using pdfminer. I am able to extract this data to a .txt file successfully with the pdfminer command line tool pdf2txt.py. I currently do this and then use a python script to clean up the .txt file. I would like to incorporate the pdf extract process into the script and save myself a step.
I thought I was on to something when I found this link, but I didn’t have success with any of the solutions. Perhaps the function listed there needs to be updated again because I am using a newer version of pdfminer.
I also tried the function shown here, but it also did not work.
Another approach I tried was to call the script within a script using os.system. This was also unsuccessful.
I am using Python version 2.7.1 and pdfminer version 20110227.
Here is a cleaned up version I finally produced that worked for me. The following just simply returns the string in a PDF, given its filename. I hope this saves someone time.
This solution was valid until API changes in November 2013.