I am trying to get text data from a pdf using pdfminer . I

Question

0

Asked: May 27, 20262026-05-27T05:15:46+00:00 2026-05-27T05:15:46+00:00

I am trying to get text data from a pdf using pdfminer . I

0

I am trying to get text data from a pdf using pdfminer. I am able to extract this data to a .txt file successfully with the pdfminer command line tool pdf2txt.py. I currently do this and then use a python script to clean up the .txt file. I would like to incorporate the pdf extract process into the script and save myself a step.

I thought I was on to something when I found this link, but I didn’t have success with any of the solutions. Perhaps the function listed there needs to be updated again because I am using a newer version of pdfminer.

I also tried the function shown here, but it also did not work.

Another approach I tried was to call the script within a script using os.system. This was also unsuccessful.

I am using Python version 2.7.1 and pdfminer version 20110227.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T05:15:47+00:00

Here is a cleaned up version I finally produced that worked for me. The following just simply returns the string in a PDF, given its filename. I hope this saves someone time.

from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO

def convert_pdf(path):

    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)

    fp = file(path, 'rb')
    process_pdf(rsrcmgr, device, fp)
    fp.close()
    device.close()

    str = retstr.getvalue()
    retstr.close()
    return str

This solution was valid until API changes in November 2013.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to get text data from a pdf using pdfminer . I

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply