Some web servers return content-length set to zero in the HTTP response headers. I’d like a deterministic and performant solution for receiving all the data in that situation.
URL known to exhibit this behavior (additional URLs below):
http://www.washingtonpost.com/wp-dyn/content/article/2010/02/12/AR2010021204894.html?hpid=topnews
headers:
Cache-control:no-cache
Connection:close
Content-Encoding:gzip
Content-type:text/html
Server:Web Server
Transfer-encoding:chunked
My current solution is not guaranteed to get all the data due to the MaxTries constant and is slow due to Thread.Sleep()
private bool MoreDataIsAvailable()
{
int avail = _socket.Available;
if (avail == 0 &&
_contentLength != null && _contentLength == 0)
{
int tries = 0;
while (avail == 0 && tries < MaxTries)
{
Thread.Sleep(5);
_socket.Poll(1000, SelectMode.SelectRead);
avail = _socket.Available;
tries++;
if (avail > 0)
{
Console.WriteLine(_socket.Handle + " avail = " + avail + " received = " + _bytes.Length + " && tries = " + tries);
}
}
}
return avail > 0;
}
Usage in context:
private void ReceiveCallback(object sender, SocketAsyncEventArgs e)
{
if (ConnectionWasClosed(e) || HadSocketError(e))
{
_receiveDone.Set();
return;
}
StoreReceivedBytes(e);
if (AllBytesReceived())
{
_receiveDone.Set();
return;
}
if (MoreDataIsExpected() || MoreDataIsAvailable())
{
WaitForBytes(e);
}
else
{
_receiveDone.Set();
}
}
Sample output:
1436 avail = 3752 received = 1704 && tries = 9
1436 avail = 3752 received = 9208 && tries = 8
1436 avail = 3752 received = 12960 && tries = 9
1436 avail = 3752 received = 20464 && tries = 8
1436 avail = 3752 received = 27968 && tries = 7
1436 avail = 7504 received = 31720 && tries = 1
1436 avail = 3752 received = 39224 && tries = 6
edit:
Nikolai observed that responses with a Transfer-encoding: chunked header need special handling but their ends can be detected deterministically.
Excluding the chunked responses, however, there are still other URLs that end up in my catch-all method, examples:
http://www.biomedcentral.com/1471-2105/6/197
headers:
Cache-control:private
Connection:close
Content-Type:text/html
P3P:policyref="/w3c/p3p.xml", CP="NOI DSP COR CURa ADMa DEVa TAIa OUR BUS PHY ONL UNI COM NAV INT DEM PRE"
Server:Microsoft-IIS/5.0
X-Powered-By:ASP.NET
http://slampp.abangadek.com/info/
headers:
Connection:close
Content-Type:text/html
Server:Apache/2.2.8 (Ubuntu) DAV/2 PHP/5.2.4-2ubuntu5.3 with Suhosin-Patch mod_ruby/1.2.6 Ruby/1.8.6(2007-09-24) mod_ssl/2.2.8 OpenSSL/0.9.8g
X-Cache:MISS from server03.abangadek.com
X-Powered-By:PHP/5.2.4-2ubuntu5.3
http://video.forbes.com/embedvideo/?format=frame&height=515&width=336&mode=render&networklink=1
headers:
Connection:close
Content-Language:en-US
Content-Type:text/html;charset=ISO-8859-1
Server:Apache-Coyote/1.1
I would like to know what I can look for in these responses that, like the Transfer-encoding header did for the first URL, gives a clue to reading the entire response deterministically so that the call to my method can be avoided.
From the URL given it seems you are looking at HTTP Chunked Transfer Encoding, which allows the server to start transmitting the response before total length is known while still allowing the client to reliably determine end of the response.
Also see RFC 2616, section 3.6.1.