I’m using python+mechanize, attempting to scrape a site. If I visit this site with links, a text-only version of the login page appears. This is what I’d like to see with my scraper. So:
import mechanize
USER_AGENT = "Links (2.3pre1; Linux 2.6.32-5-xen-amd64 x86_64; 80x24)"
mech = mechanize.Browser(factory=mechanize.RobustFactory())
mech.addheaders = [('User-agent', USER_AGENT)]
mech.set_handle_robots(False)
resp = mech.open(URLS['start'])
fnout("001-login.html", resp.read())
resp.close()
fnout just dumps the string to a file. Yet, when I open 001-login.html, the entirety of the page is the word “Robot”. Nothing else.
I haven’t made any other requests. It’s not like I loaded the page & didn’t load the images, or whatever. This was the first request I made, and I put the User-Agent as exactly what the version of Links that the site worked with had. What am I doing wrong (besides trying to scrape a site that doesn’t want to be scrape, that is)?
Probably there are other headers that
linksis sending that Mechanize is not, or vice versa. Hit up http://www.reliply.org/tools/requestheaders.php with bothlinksand Mechanize and see what headers are being sent.