I have been having a persistent problem getting an rss feed from a particular website. I wound up writing a rather ugly procedure to perform this function, but I am curious why this happens and whether any higher level interfaces handle this problem properly. This problem isn’t really a show stopper, since I don’t need to retrieve the feed very often.
I have read a solution that traps the exception and returns the partial content, yet since the incomplete reads differ in the amount of bytes that are actually retrieved, I have no certainty that such solution will actually work.
#!/usr/bin/env python
import os
import sys
import feedparser
from mechanize import Browser
import requests
import urllib2
from httplib import IncompleteRead
url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
content = feedparser.parse(url)
if 'bozo_exception' in content:
print content['bozo_exception']
else:
print "Success!!"
sys.exit(0)
print "If you see this, please tell me what happened."
# try using mechanize
b = Browser()
r = b.open(url)
try:
r.read()
except IncompleteRead, e:
print "IncompleteRead using mechanize", e
# try using urllib2
r = urllib2.urlopen(url)
try:
r.read()
except IncompleteRead, e:
print "IncompleteRead using urllib2", e
# try using requests
try:
r = requests.request('GET', url)
except IncompleteRead, e:
print "IncompleteRead using requests", e
# this function is old and I categorized it as ...
# "at least it works darnnit!", but I would really like to
# learn what's happening. Please help me put this function into
# eternal rest.
def get_rss_feed(url):
response = urllib2.urlopen(url)
read_it = True
content = ''
while read_it:
try:
content += response.read(1)
except IncompleteRead:
read_it = False
return content, response.info()
content, info = get_rss_feed(url)
feed = feedparser.parse(content)
As already stated, this isn’t a mission critical problem, yet a curiosity, as even though I can expect urllib2 to have this problem, I am surprised that this error is encountered in mechanize and requests as well. The feedparser module doesn’t even throw an error, so checking for errors depends on the presence of a ‘bozo_exception’ key.
Edit: I just wanted to mention that both wget and curl perform the function flawlessly, retrieving the full payload correctly every time. I have yet to find a pure python method to work, excepting my ugly hack, and I am very curious to know what is happening on the backend of httplib. On a lark, I decided to also try this with twill the other day and got the same httplib error.
P.S. There is one thing that also strikes me as very odd. The IncompleteRead happens consistently at one of two breakpoints in the payload. It seems that feedparser and requests fail after reading 926 bytes, yet mechanize and urllib2 fail after reading 1854 bytes. This behavior is consistend, and I am left without explanation or understanding.
At the end of the day, all of the other modules (
feedparser,mechanize, andurllib2) callhttplibwhich is where the exception is being thrown.Now, first things first, I also downloaded this with wget and the resulting file was 1854 bytes. Next, I tried with
urllib2:So it is reading all 1854 bytes but then thinks there is more to come. If we explicitly tell it to read only 1854 bytes it works:
Obviously, this is only useful if we always know the exact length ahead of time. We can use the fact the partial read is returned as an attribute on the exception to capture the entire contents:
This blog post suggests this is a fault of the server, and describes how to monkey-patch the
httplib.HTTPResponse.read()method with thetry..exceptblock above to handle things behind the scenes:I applied the patch and then
feedparserworked:This isn’t the nicest way of doing things, but it seems to work. I’m not expert enough in the HTTP protocols to say for sure whether the server is doing things wrong, or whether
httplibis mis-handling an edge case.