Ok, so i’m writting a program that need to log in to a website, want the scrape some information out of it.
He’re is my code for loging on:
module Webscraper =
open System.Net
open HtmlAgilityPack
open Lolcr.Model
open System.Collections.Specialized
let logon = fun (address:string) studentNumber password->
let upload values =
let wc = new WebClient()
wc.UploadValues (address, values)
let ToNameValueCollection nvs =
let col = new NameValueCollection()
for nv in nvs do
match nv with (n, v) -> col.Add(n, v);
col
let fields :List<string*string> =
("v_studentid",studentNumber) ::
("v_studentpin", password) ::
("b3", "Login") :: []
let resp = fields |> ToNameValueCollection |> upload;
resp |> Array.map char |> System.String.Concat
//and for viewing a page within the site:
let pageAt = fun (address : string) ->
let getWebStream =
let req = HttpWebRequest.Create address
let resp = req.GetResponse()
resp.GetResponseStream
let doc = new HtmlDocument()
getWebStream() |> doc.Load;
doc.DocumentNode
Now when I call logon, it returns the text of the logon page as if i hadden’t loged on (poss cos logging on would have done a redirect in the browser)
when I call PageAt on the page Im interested in it retuyrns the “Please log in” page.
Looking at what is happening from Fiddler2: (Where XXXX and YYYY are studentNumber and password respecitively):
//Via firefox
POST https://server2.olcr.uwa.edu.au/olcrstudent/index.jsp HTTP/1.1
Host: server2.olcr.uwa.edu.au
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: https://server2.olcr.uwa.edu.au/olcrstudent/
Cookie: JSESSIONID=18F87DFEB1555A6FA644215FDAE5E506; __utma=55889711.14817822.1328281214.1328281214.1328281214.1; __utmz=55889711.1328281214.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=olcr%20uwa; __utmb=55889711.1.10.1328281214; __utmc=55889711
Content-Type: application/x-www-form-urlencoded
Content-Length: 53
v_studentid=XXXX&v_studentpin=YYYY&b3=Login
//From my program:
POST https://server2.olcr.uwa.edu.au/olcrstudent/index.jsp HTTP/1.1
Content-Type: application/x-www-form-urlencoded
Host: server2.olcr.uwa.edu.au
Content-Length: 53
Expect: 100-continue
Connection: Keep-Alive
v_studentid=XXXX&v_studentpin=YYYY&b3=Login
So the big difference from my looking at it is that i’m not sending any cookies (I’m actually not entirely sure what cookies are, come to think of it (I’ll look that up (EDIT:Done)))
So should I be sending cookies?
What are the mechanisms for this in .net?
Should I be doing somehtingdiffernt cos this is HTTPS?
Yes, normally you will need to persist cookies to log in to sites.
A CookieAwareWebclient such as the one from: this blog,
makes it simple.
The F# equivelent is
Now so long as you do all your webrequests though the same Webclient (so you will have to make the webclient accessable throughout the module, and change pageAt to use the it)
you will be fine