I am trying to download page_source to a file. However, every time I get a:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 (or something else) in
position 8304: ordinal not in range(128)
I’ve tried using value.encode('utf-8'), but it seems every time it throws the same exception (in addition to manually trying to replace every non-ascii character). Is there a way to ‘pre-process’ the html to put it into a ‘write-able’ format?
There are third party libraries such as BeautifulSoup and lxml that can deal with encoding issues automatically. But here’s a crude example using just
urlllib2:First download some webpage containing non-ascii characters:
Now have a look for the “charset” at the top of the page:
If there was no obvious charset, “UTF-8” is usually a good guess, anyway.
Finally, convert the webpage to unicode text: