When attempting to check the ‘content-length’ header for some web pages using urllib2 in python, the header is missing. For example, the response from google.com is missing this header. Any idea why?
Example:
r = urllib2.urlopen('http://www.google.com')
i = r.info()
print i.keys()
Gives:
['x-xss-protection', 'set-cookie', 'expires', 'server', 'connection', 'cache-control', 'date', 'p3p', 'content-type', 'x-frame-options']
You can see here that an http response can either contain
Content-LengthorTransfer-Encoding: chunked.However, when
Transfer-Encoding: chunkedis used in the header, after the headers, you’ll get a hexadecimal string which if converted to decimal, will give you the length of the next chunk. And after the last chunk you’ll get a0for this value which means you’ve reached the end of the file.You can use regular expressions to get this hexadecimal value (not a must though)
or You can just read the first hexadecimal value, get the length of the first chunk and receive that chunk, then get the length of the next chunk and so on till you find a
0