I’m using Python Google App Engine to simply fetch html pages and show it. My aim is to be able to fetch any page in any language. Now I have a problem with encoding:
Simple
result = urllib2.urlopen(url).read()
leaves artifacts in place of special letters and
urllib2.urlopen(url).read().decode('utf8')
throws error:
‘utf8’ codec can’t decode bytes in position 3544-3546: invalid data
So how to solve it? Is there any lib that would check what encoding
page is and convert so it would be readable?
rajax sugested at How to download any(!) webpage with correct charset in python? to use chardet lib from http://chardet.feedparser.org/
This code seems to work, now: