So far here is the code I have (it is working and extracting text as it should.)
import pyPdf
def getPDFContent(path):
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
print getPDFContent("/home/nick/TAM_work/TAM_pdfs/2006-1.pdf").encode("ascii", "ignore")
I now need to add a for loop to get it to run on all PDF’s in /TAM_pdfs, save the text as a CSV and (if possible) add something to count the pictures. Any help would be greatly appreciated. Thanks for looking.
Matt
for loop to get it to run on all PDF’s in a directory: look at the glob module
save the text as a CSV: look at the csv module
count the pictures: look at the pyPDF module 🙂
Two comments on this statement:
(1) It is not necessary to replace the NBSP (U+00A0) with a SPACE, because NBSP is (naturally) considered to be whitespace by
unicode.split()(2) Using strip() is redundant: