I’m trying to parse download pages from http://www.mediafire.com, but i really often get a System.Net.WebException with the following message, when i try to load a page to a HtmlDocument:
The server committed a protocol
violation. Section=ResponseStatusLine
This is my code:
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = null;
string url = www.mediafire.com/?abcdefghijkl //There are many different links
try
{
doc = web.Load(url); //From 30 links, usually only 10 load properly
}
catch (WebException)
{
}
Any ideas why only 10 of 30 links work (the links change everytime, because my program is a “search engine”) and how i can resolve the problem?
When i load those sites in my browser, everything works fine.
I’ve tried to add the following lines to my app.config, but that doesn’t help either
<system.net>
<settings>
<httpWebRequest useUnsafeHeaderParsing="true" />
</settings>
</system.net>
This is not related to the Html Agility Pack directly, but rather to the underlying HTTP/socket layer. This error means the server is not sending back a correct HTTP status line.
The status line is defined in HTTP RFC available here: http://www.w3.org/Protocols/rfc2616/rfc2616-sec6.html
I quote:
You can add socket traces with full hex report to check this:
This will create a SocketTrace.log file in the current executing directory. Have a look in there, the protocol violation should be visible. You can post it here if it’s not too big 🙂
Unfortunately, if you don’t own the server, there is not much you can do (if you already added the useUnsafeHeaderParsing setting, which is good) but fail gracefully in these cases.