Trying to scrape the content off a site with Python, that has a simple form authentication with username and password, but also has a hidden field called “foil” that contains what looks like a randomly generated string each time the page is loaded. In order to successfully login that value must be included in the content header of the post. I’ve tried scraping out the random string after the login page loads but still redirects me back to login. I have a valid username and password for the site that works, but it is update sporadically and I would like to send myself an email when something changes. here is the code i’ve been working with so far…
import urllib, urllib2, cookielib,subprocess
url='https://example.com/login.asp'
username='blah'
password='blah'
request = urllib2.Request(url)
opener = urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1))
preData = opener.open(request).readlines()
for line in preData:
if("foil" in line):
foils = line.split('"')
notFoiled = foils[3]
query_args={'location':'','qstring':'','absr_ID':notFoiled,'id':username,'pin':password,'submit':'Sign In'}
requestWheader = urllib2.Request('https://example.com/login.asp')
requestWheader.add_data(urllib.urlencode(query_args))
print 'Request method after data :', requestWheader.get_method()
print
print 'OUTGOING DATA:'
print requestWheader.get_data()
print
print 'SERVER RESPONSE:'
print urllib2.urlopen(requestWheader).read()
rawRes = urllib2.urlopen(requestWheader).read()
The form looks like this…
<form name="loginform" method="post" action="https://example.com/login.asp?x=x&&pswd=">
<input type=hidden name="location" value="">
<input type=hidden name="qstring" value="">
<input type=hidden name="absr_ID" value="">
<input type=hidden name="foil" value="91fcMO">
<input type="text" name="id" maxlength="80" size="21" value="" mask="" desc="ID" required="true">
<input type="submit" name="submit" value="Sign In" onClick="return checkForm(loginform)">
<input type="password" name="pin" size="6" maxlength="6" desc="Pin" required="true">
You import
cookielibbut it does not seem like you’re using anyCookieJars:Then use the same opener for both initial form fetching and login form submission. I assume it’s a cookie-based protection where a value that comes from the
foilfield has to match a cookie that comes in the headers.Another thing I noticed in your code is that you assign
notFoiledtoabsr_IDinstead offoil. Was that intentional?Also please do yourself a favor and use
html5liborBeautifulSoupinstead of parsing HTML manually.