I am just trying to retrieve a web page, but somehow a foreign character is embedded in the HTML file. This character is not visible when I use “View Source.”
isbn = 9780141187983
url = "http://search.barnesandnoble.com/booksearch/isbninquiry.asp?ean=%s" % isbn
opener = urllib2.build_opener()
url_opener = opener.open(url)
page = url_opener.read()
html = BeautifulSoup(page)
html #This line causes error.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 21555: ordinal not in range(128)
I also tried…
html = BeautifulSoup(page.encode('utf-8'))
How can I read this web page into BeautifulSoup without getting this error?
This error is probably actually happening when you try to print the representation of the BeautifulSoup file, which will happen automatically if, as I suspect, you are working in the interactive console.