I am writing my own C# 4.0 WPF specific web crawler. Currently I am using htmlagilitypack to process html documents.
Now the way below i am downloading the pages
HtmlWeb hwWeb = new HtmlWeb();
hwWeb.UserAgent = lstAgents[GenerateRandomValue.GenerateRandomValueMin(irAgentsCount, 0)];
hwWeb.PreRequest = OnPreRequest;
HtmlDocument hdMyDoc;
hwWeb = new HtmlWeb
{
AutoDetectEncoding = false,
OverrideEncoding = Encoding.GetEncoding("iso-8859-9"),
};
hdMyDoc = hwWeb.Load(srPageUrl);
private static bool OnPreRequest(HttpWebRequest request)
{
request.AllowAutoRedirect = true;
return true;
}
Now my question is i want to be able to determine whether given url is text/html (crawlable content) or image/pdf simply other types. How can i do that ?
Thank you very much for the answers.
C# 4.0 , WPF application
Rather than relying on HTMLAgilityPack to download it for you, you can download the page with
HttpWebRequestwhich contains a property on theHttpWebResponsethat you can check. This would allow you to perform your check before attempting to parse the content.