I’m trying to extract each page of a PDF as a string:
import pyPdf
pages = []
pdf = pyPdf.PdfFileReader(file('g-reg-101.pdf', 'rb'))
for i in range(0, pdf.getNumPages()):
this_page = pdf.getPage(i).extractText() + "\n"
this_page = " ".join(this_page.replace(u"\xa0", " ").strip().split())
pages.append(this_page.encode("ascii", "xmlcharrefreplace"))
for page in pages:
print '*' * 80
print page
But this script ignore newline characters, leaving me with messy strings like information concerning an individual which, because of name, identifyingnumber, mark or description (i.e, this should read identifying number, not identifyingumber).
Here’s an example of the type of PDF I’m trying to parse.
I don’t know much about PDF encoding, but I think you can solve your particular problem by modifying
pdf.py. In thePageObject.extractTextmethod, you see what’s going on:If the operator is
TjorTJ(it’s Tj in your example PDF) then the text is simply appended and no newline is added. Now you wouldn’t necessarily want to add a newline, at least if I’m reading the PDF reference right:Tj/TJare simply the single and multiple show-string operators, and the existence of a separator of some kind isn’t mandatory.Anyway, if you modify this code to be something like
[…]
[…]
then the default behaviour should be the same:
but you can change it when you want to:
or
Alternatively, you could simply add the separators yourself by modifying the operands themselves in-place, but that could break something else (methods like
get_original_bytesmake me nervous).Finally, you don’t have to edit
pdf.pyitself if you don’t want to: you could simply pull out this method into a function.