So far here is the code I have (it is working and extracting text

Question

0

Asked: May 13, 20262026-05-13T08:23:16+00:00 2026-05-13T08:23:16+00:00

So far here is the code I have (it is working and extracting text

0

So far here is the code I have (it is working and extracting text as it should.)

import pyPdf

def getPDFContent(path):
    content = ""
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(i).extractText() + "\n"
    # Collapse whitespace
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

print getPDFContent("/home/nick/TAM_work/TAM_pdfs/2006-1.pdf").encode("ascii", "ignore")

I now need to add a for loop to get it to run on all PDF’s in /TAM_pdfs, save the text as a CSV and (if possible) add something to count the pictures. Any help would be greatly appreciated. Thanks for looking.

Matt

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-13T08:23:16+00:00

for loop to get it to run on all PDF’s in a directory: look at the glob module

save the text as a CSV: look at the csv module

count the pictures: look at the pyPDF module 🙂

Two comments on this statement:

content = " ".join(content.replace(u"\xa0", " ").strip().split())

(1) It is not necessary to replace the NBSP (U+00A0) with a SPACE, because NBSP is (naturally) considered to be whitespace by unicode.split()

(2) Using strip() is redundant:

>>> u"  foo  bar  ".split()
[u'foo', u'bar']
>>>

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

So far here is the code I have (it is working and extracting text

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply