I have the following sample code where I download a pdf from the European

Question

0

Asked: May 16, 20262026-05-16T15:56:00+00:00 2026-05-16T15:56:00+00:00

I have the following sample code where I download a pdf from the European

0

I have the following sample code where I download a pdf from the European Parliament website on a given legislative proposal:

EDIT: I ended up just getting the link and feeding it to adobes online conversion tool (see the code below):

import mechanize
import urllib2
import re
from BeautifulSoup import *

adobe = "http://www.adobe.com/products/acrobat/access_onlinetools.html"

url = "http://www.europarl.europa.eu/oeil/search_reference_procedure.jsp"

def get_pdf(soup2):
    link = soup2.findAll("a", "com_acronym")
    new_link = []
    amendments = []
    for i in link:
        if "REPORT" in i["href"]:
            new_link.append(i["href"])
    if new_link == None:
        print "No A number"
    else:
        for i in new_link:
            page = br.open(str(i)).read()
            bs = BeautifulSoup(page)
            text = bs.findAll("a")
            for i in text:
                if re.search("PDF", str(i)) != None:
                    pdf_link = "http://www.europarl.europa.eu/" + i["href"]
            pdf = urllib2.urlopen(pdf_link)
            name_pdf = "%s_%s.pdf" % (y,p)
            localfile = open(name_pdf, "w")
            localfile.write(pdf.read())
            localfile.close()

            br.open(adobe)
            br.select_form(name = "convertFrm")
            br.form["srcPdfUrl"] = str(pdf_link)
            br["convertTo"] = ["html"]
            br["visuallyImpaired"] = ["notcompatible"]
            br.form["platform"] =["Macintosh"]
            pdf_html = br.submit()

            soup = BeautifulSoup(pdf_html)


page = range(1,2) #can be set to 400 to get every document for a given year
year = range(1999,2000) #can be set to 2011 to get documents from all years

for y in year:
    for p in page:
        br = mechanize.Browser()
        br.open(url)
        br.select_form(name = "byReferenceForm")
        br.form["year"] = str(y)
        br.form["sequence"] = str(p)
        response = br.submit()
        soup1 = BeautifulSoup(response)
        test = soup1.find(text="No search result")
        if test != None:
            print "%s %s No page skipping..." % (y,p)
        else:
            print "%s %s  Writing dossier..." % (y,p)
            for i in br.links(url_regex="file.jsp"):
                link = i
            response2 = br.follow_link(link).read()
            soup2 = BeautifulSoup(response2)
            get_pdf(soup2)

In the get_pdf() function I would like to convert the pdf file to text in python so I can parse the text for information about the legislative procedure. can anyone explaon me how this can be done?

Thomas

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-16T15:56:00+00:00

It’s not exactly magic. I suggest

downloading the PDF file to a temp directory,
calling out to an external program to extract the text into a (temp) text file,
reading the text file.

For text extraction command-line utilities you have a number of possibilities and there may be others not mentioned in the link (perhaps Java-based). Try them first to see if they fit your needs. That is, try each step separately (finding the links, downloading the files, extracting the text) and then piece them together. For calling out, use subprocess.Popen or subprocess.call().

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have the following sample code where I download a pdf from the European

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply