I was using HTMLAgilityPack to get the HTML from following Website: http://tennis.wettpoint.com/en/
It worked fine, but now.. after a hour it doesn’t work anymore!
First I tried to change my Code – on how I retrieve the HTML:
string url = "http://tennis.wettpoint.com/en/";
HtmlWeb hw = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = hw.Load(url);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
//Code..
}
Like I said, that worked always fine.. until the site seemed “down” for me..
SO I changed the code to:
using (WebClient wc = new WebClient())
{
wc.Headers.Add("user-agent", "Mozilla/5.0 (Windows; Windows NT 5.1; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4");
string html = wc.DownloadString("http://en.wikipedia.org/wiki/United_States");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
}
(That didn’t work for my site, but worked for an other site)
and at least I have this now, which also works, but not for my site:
HtmlAgilityPack.HtmlDocument doc = GetHTMLDocumentByURL(url);
public HtmlAgilityPack.HtmlDocument GetHTMLDocumentByURL(string url)
{
var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionReadEncoding = false;
var request = (HttpWebRequest)WebRequest.Create(url);
request.UserAgent = @"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5";
request.Method = "GET";
using (var response = (HttpWebResponse)request.GetResponse())
{
using (var stream = response.GetResponseStream())
{
htmlDoc.Load(stream, Encoding.UTF8);
}
}
return htmlDoc;
}
Well at first I believed the site is down, cause I can’t access the site with any Browser either.. So I asked friends and they were able to access the site.. So that means my IP had been blocked.. Whyever.. What can I do? Do I need to change my Ip (how) or use Proxys (how).. I have no clue, as I didn’t mention that this would happen 🙁 Hope someone can help me..
Wikipedia monitors the number of requests it gets from an IP address and will ban IP’s from aggressively scraping it’s content. Scraping Google search results will have the same effect.
Initially Wikipedia will only ban you for 24 hours, but if you carry on “offending”, your IP will be banned permanently.
You can either – use proxy’s in your HttpRequest to change your IP address or slow down your requests.