Using System.Net.HttpRequest I would like to imitate a users search on the following search engine in my code.
An example of the search URL is as follows:
http://www.scirus.com/srsapp/search?q=core+facilities&t=all&sort=0&g=s
I have the following code to perform the HTTP GET. Note I’m using the HtmlAgilityPack.
protected override HtmlDocument MakeRequestHtml(string requestUrl)
{
try
{
HttpWebRequest request = WebRequest.Create(requestUrl) as HttpWebRequest;
request.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)";
HttpWebResponse response = request.GetResponse() as HttpWebResponse;
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.Load(response.GetResponseStream());
return (htmlDoc);
}
catch (Exception e)
{
Console.WriteLine(e.Message);
Console.Read();
return null;
}
}
Where “requestUrl” is the example search URL shown above.
The contents of htmlDoc.DocumentNode.InnerHtml contains no search results and looks nothing like the search results page you would get if you copy pasted the example search URL shown above into your browser.
I’m guessing the reason for this is because you must first have a session in order to be able to perform requests. Can anybody advise if there is a feasible way to replicate the behavior of the user agent? Or perhaps there is a better way of achieving the goal of “scraping” the search results that I don’t know about? Suggestions please.
Robots.txt contents:
# / robots.txt file for http://www.scirus.com
User-agent: NetMechanic
Disallow: /srsapp/sciruslink
User-agent: *
Disallow: /srsapp/sciruslink
Disallow: /srsapp/search
Disallow: /srsapp/search_simple
Disallow: /search_simple
# for dev and accept server uncomment below line at Build time to disallow robots completely
##Disallow: /
Content of htmlDoc.DocumentNode.InnerHtml

OK I actually tested with webclient
And here is the downloaded file http://pastebin.com/qswtgC4n