I need to automate a process involving a website that is using a login form. I need to capture some data in the pages following the login page.
I know how to screen-scrape normal pages, but not those behind a secure site.
- Can this be done with the .NET WebClient class?
- How would I automatically login?
- How would I keep logged in for the other pages?
One way would be through automating a browser — you mentioned WebClient, so I’m guessing you might be referring to WebClient in .NET.
Two main points:
Here’s the steps I’d follow:
On step 2, I mention a somewhat complicated method for automating the login. Usually, you can post with username and password directly to the known login form action without getting the initial form or relaying the hidden fields. Some sites have form validation (different from field validation) on their forms which makes this method not work.
HtmlAgilityPack is a .NET library that allows you to turn ill-formed html into an XmlDocument so you can XPath over it. Quite useful.
Finally, you may run into a situation where the form relies on client script to alter the form values before submitting. You may need to simulate this behavior.
Using a tool to view the http traffic for this type of work is extremely helpful – I recommend ieHttpHeaders, Fiddler, or FireBug (net tab).