I’m working on something that pulls in urls from delicious and then uses those urls to discover associated feeds.
However, some of the bookmarks in delicious are not html links and cause BS to barf. Basically, I want to throw away a link if BS fetches it and it does not look like html.
Right now, this is what I’m getting.
trillian:Documents jauderho$ ./d2o.py 'green data center' processing http://www.greenm3.com/ processing http://www.eweek.com/c/a/Green-IT/How-to-Create-an-EnergyEfficient-Green-Data-Center/?kc=rss Traceback (most recent call last): File './d2o.py', line 53, in <module> get_feed_links(d_links) File './d2o.py', line 43, in get_feed_links soup = BeautifulSoup(html) File '/Library/Python/2.5/site-packages/BeautifulSoup.py', line 1499, in __init__ BeautifulStoneSoup.__init__(self, *args, **kwargs) File '/Library/Python/2.5/site-packages/BeautifulSoup.py', line 1230, in __init__ self._feed(isHTML=isHTML) File '/Library/Python/2.5/site-packages/BeautifulSoup.py', line 1263, in _feed self.builder.feed(markup) File '/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/HTMLParser.py', line 108, in feed self.goahead(0) File '/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/HTMLParser.py', line 150, in goahead k = self.parse_endtag(i) File '/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/HTMLParser.py', line 314, in parse_endtag self.error('bad end tag: %r' % (rawdata[i:j],)) File '/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/HTMLParser.py', line 115, in error raise HTMLParseError(message, self.getpos()) HTMLParser.HTMLParseError: bad end tag: u'</b />', at line 739, column 1
Update:
Jehiah’s answer does the trick. For reference, here’s some code to get the content type:
def check_for_html(link): out = urllib.urlopen(link) return out.info().getheader('Content-Type')
I simply wrap my BeautifulSoup processing and look for the
HTMLParser.HTMLParseErrorexceptionbut further than that, you can check the content type of the responses when you crawl a page and make sure that it’s something like
text/htmlorapplication/xml+xhtmlor something like that before you even try to parse it. That should head off most errors.