I am fetching a webpage (http://autoweek.com) and trying to process it but getting encoding error. Autoweek declares “iso-8859-1” encoding and has the word “Nürburgring” (u with umlaut)
I do:
# -*- encoding: utf-8 -*-
import urllib
webpage = urllib.urlopen(feed.crawl_url).read()
webpage.decode("utf-8")
it gives me the following error:
'utf8' codec can't decode bytes in position 7768-7773: unsupported Unicode code range"
if I bypass .decode step and do some parsing with lxml library, it raises an error when I am saving parsed title to database:
'utf8' codec can't decode bytes in position 45-50: unsupported Unicode code range
My database has character set utf8 and collation utf-general-ci
My settings:
Django
Python 2.4.3
MySQL 5.0.22
MySQL-python 1.2.1
mod_python 3.2.8
autoweek.com seems confused about it’s own encoding. It declares conflicting charset definitions:
and later…
iso-8859-1 is the correct one since this is returned in the header from the web server and by the
.info()method (and it actually decodes), but this demonstrates that you can’t necessarily rely on the Content-Type declaration in web pages. You should follow the method described by lavinio.