I am working on a C# project where I need to get data from a secured web site that does not have an API or web services. My plan is to login, get to the page I need, and parse out the HTML to get to the data bits I need to log to a database. Right now I’m testing with a console app, but eventually this will be converted to an Azure Service bus application.
In order to get to anything, you have to login at their login.cfm page, which means I need to load the username and password input controls on the page and click the submit button. Then navigate to the page I need to parse.
Since I don’t have a ‘browser’ to parse for controls, I am trying to use various C# .NET classes to get to the page, set the username and password, and click submit, but nothing seems to work.
Any examples I can look at, or .NET classes I should be reviewing that were designed for this sort of project?
Thanks!
Use the WebClient class in System.Net
For persistence of session cookie you’ll have to make a custom WebClient class.
Use a browser add-on like FireBug or the development tools built into Chrome to get the HTTP POST data being sent when you submit a form. Send those POSTs using the WebClientX class and parse the response HTML.
The fastest way to parse HTML when you already know the format is using a simple Regex.Match. So you’d go through the actions in your browser using the development tools to record your POSTs, URLs and HTML content then you’ll perform the same tasks using the WebClientX.