I am trying to figure out how to login into a secure website in order to parse user specific data and I can’t really find specific example of how to do so. I would like to write it in PHP but many of searches haven’t really turned up anything for that language. I’m familiar with Python and feel like maybe that would be of more use in this scenario. It also seems that many sites have API’s specific to that site to login. But searching and using specific API’s seems like more work for something I could write once then adapt.
For example: How could I login into stackoverflow programmatically and then parse my profile to fetch the total number of consecutive days i’ve logged in.
Using Simple_HTML_DOM I have written this which I’ve used before to parse non-secured html
<?php
include_once('simple_html_dom.php');
$html = file_get_html("http://stackoverflow.com/users/779920/nick");
foreach($html->find('[class=days-visited]') as $e)
echo $e->outertext . '<br>';
?>
But it doesn’t work in this case. I’m not sure if this is on the right track but I have tried familirized with POST data using firebug for Chrome but the tool is rather complex to me right now and I’m not exactly sure of how to properly decipher the data I’m given.
Any help would be appreciated.
I think that it depends on exactly what system the page is using for authentication, but here is a snippet I used recently for exactly the same thing. In my case, I simply wanted to download the page:
I refer you to the urllib documentation (for python3. In python2 it is urllib2). It is reasonably well documented, although it took me a bit of trial and error to figure out the exact steps I needed. Note that the authentication handler only needs to handle to
root you log into (in this case http://secure.website.com). Once you’ve installed the handler it will recognise any pages belonging to that domain and use the authentication information given. Also remember that this is not all that secure – anyone with access to the code will be able to see your login details.
If you subsequently want to parse the webpage, you can use html.parser (or the python2 version, HTMLParser), or the much more powerful BeautifulSoup.