I’m working on something that pulls in urls from delicious and then uses those

Question

0

Asked: May 11, 20262026-05-11T04:09:37+00:00 2026-05-11T04:09:37+00:00

I’m working on something that pulls in urls from delicious and then uses those

0

I’m working on something that pulls in urls from delicious and then uses those urls to discover associated feeds.

However, some of the bookmarks in delicious are not html links and cause BS to barf. Basically, I want to throw away a link if BS fetches it and it does not look like html.

Right now, this is what I’m getting.

trillian:Documents jauderho$ ./d2o.py 'green data center'  processing http://www.greenm3.com/ processing http://www.eweek.com/c/a/Green-IT/How-to-Create-an-EnergyEfficient-Green-Data-Center/?kc=rss Traceback (most recent call last):   File './d2o.py', line 53, in <module>     get_feed_links(d_links)   File './d2o.py', line 43, in get_feed_links     soup = BeautifulSoup(html)   File '/Library/Python/2.5/site-packages/BeautifulSoup.py', line 1499, in __init__     BeautifulStoneSoup.__init__(self, *args, **kwargs)   File '/Library/Python/2.5/site-packages/BeautifulSoup.py', line 1230, in __init__     self._feed(isHTML=isHTML)   File '/Library/Python/2.5/site-packages/BeautifulSoup.py', line 1263, in _feed     self.builder.feed(markup)   File '/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/HTMLParser.py', line 108, in feed     self.goahead(0)   File '/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/HTMLParser.py', line 150, in goahead     k = self.parse_endtag(i)   File '/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/HTMLParser.py', line 314, in parse_endtag     self.error('bad end tag: %r' % (rawdata[i:j],))   File '/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/HTMLParser.py', line 115, in error     raise HTMLParseError(message, self.getpos()) HTMLParser.HTMLParseError: bad end tag: u'</b  />', at line 739, column 1

Update:

Jehiah’s answer does the trick. For reference, here’s some code to get the content type:

def check_for_html(link):     out = urllib.urlopen(link)     return out.info().getheader('Content-Type')

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

score 0 · Answer 1 · 2026-05-11T04:09:37+00:00

I simply wrap my BeautifulSoup processing and look for the HTMLParser.HTMLParseError exception

import HTMLParser,BeautifulSoup try:     soup = BeautifulSoup.BeautifulSoup(raw_html)     for a in soup.findAll('a'):         href = a.['href']         .... except HTMLParser.HTMLParseError:     print 'failed to parse',url

but further than that, you can check the content type of the responses when you crawl a page and make sure that it’s something like text/html or application/xml+xhtml or something like that before you even try to parse it. That should head off most errors.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m working on something that pulls in urls from delicious and then uses those

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply