I keep getting the following error when trying to parse some html using BeautifulSoup:

Question

0

Asked: May 23, 20262026-05-23T20:16:13+00:00 2026-05-23T20:16:13+00:00

I keep getting the following error when trying to parse some html using BeautifulSoup:

0

I keep getting the following error when trying to parse some html using BeautifulSoup:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: ordinal not in range(128)

I’ve tried decoding the html using the solution to the questions below, but keep getting the same error. I’ve tried all the solutions to the questions below but none of them work (posting so that I don’t get duplicate answers and in case they help anyone to find a solution by viewing related approaches to the problem).

Anybody know where I’m going wrong here? Is this a bug in BeautifulSoup and should I install an earlier version?

EDIT: code and traceback below:

from BeautifulSoup import BeautifulSoup as bs
soup = bs(html)

Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 1282, in __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 946, in __init__
    self._feed()
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 971, in _feed
    SGMLParser.feed(self, markup)
  File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
  File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
    self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: ordinal not in range(128)

EDIT: error message per comment below:

Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 1282, in __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 946, in __init__
    self._feed()
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 971, in _feed
    SGMLParser.feed(self, markup)
  File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
  File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
    self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: ordinal not in range(128)

Thanks for your help!

'ascii' codec error in beautifulsoup

UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)

How do I convert a file's format from Unicode to ASCII using Python?

python UnicodeEncodeError > How can I simply remove troubling unicode characters?

UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T20:16:14+00:00

You say in a comment: “””I just looked up the content-type of the html I’m trying to parse to see if it was something I hadn’t tried (earlier I just assumed it was UTF-8) but sure enough it was UTF-8 so another dead end.”””

Sigh. This is exactly why I have been trying to get you to divulge the HTML that you are trying to parse. The error message indicates that the (first) problem byte is \xae which is definitely NOT a valid lead byte in a UTF-8 sequence.

Either divulge the link to your HTML, or do some basic debugging:

Does uc = html.decode('utf8') work or fail? If fail, with what error message?

You also said: “””I’m starting to think this is a bug in BS, which they allude to in the docs, and can be seen here: crummy.com/software/BeautifulSoup/CHANGELOG.html.”””

I can’t imagine which of the vague entries in the changelog you are referring to. Consider debugging your problem before you rush to update.

Update Looks like an obscure bug in sgmllib.py. In line 394, change 255 to 127 and it appears to work. Corner case: HTML char ref (®) in an attribute value AND with 128 <= ordinal < 255.

Further comments Rather than hack your copy of sgmllib.py, grab a copy of the latest sgmllib.py from the 2.7 branch — BS 3.0.4 ran OK for me on Python 2.7.1. Even better, upgrade your Python to 2.7.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I keep getting the following error when trying to parse some html using BeautifulSoup:

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply