I’m trying to scrape some data off of the FEC.gov website using python for a project of mine. Normally I use python mechanize and beautifulsoup to do the scraping.
I’ve been able to figure out most of the issues but can’t seem to get around a problem. It seems like the data is streamed into the table and mechanize.Browser() just stops listening.
So here’s the issue:
If you visit http://query.nictusa.com/cgi-bin/can_ind/2011_P80003338/1/A … you get the first 500 contributors whose last name starts with A and have given money to candidate P80003338 … however, if you use browser.open() at that url all you get is the first ~5 rows.
I’m guessing its because mechanize isn’t letting the page fully load before the .read() is executed. I tried putting a time.sleep(10) between the .open() and .read() but that didn’t make much difference.
And I checked, there’s no javascript or AJAX in the website (or at least none are visible when you use the ‘view-source’). SO I don’t think its a javascript issue.
Any thoughts or suggestions? I could use selenium or something similar but that’s something that I’m trying to avoid.
-Will
Why not use an html parser like lxml with xpath expressions.
I tried
Similarly, you can create
xpath expressionof your choice to work with.