I am trying to download page_source to a file. However, every time I get

Question

0

Asked: May 28, 20262026-05-28T00:54:28+00:00 2026-05-28T00:54:28+00:00

I am trying to download page_source to a file. However, every time I get

0

I am trying to download page_source to a file. However, every time I get a:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 (or something else) in 
position 8304: ordinal not in range(128)

I’ve tried using value.encode('utf-8'), but it seems every time it throws the same exception (in addition to manually trying to replace every non-ascii character). Is there a way to ‘pre-process’ the html to put it into a ‘write-able’ format?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-28T00:54:29+00:00

There are third party libraries such as BeautifulSoup and lxml that can deal with encoding issues automatically. But here’s a crude example using just urlllib2:

First download some webpage containing non-ascii characters:

>>> import urllib2
>>> response = urllib2.urlopen('http://www.ltg.ed.ac.uk/~richard/unicode-sample.html')
>>> data = response.read()

Now have a look for the “charset” at the top of the page:

>>> data[:200]
'<html>\n<head>\n<title>Unicode 2.0 test page</title>\n<meta
content="text/html; charset=UTF-8" http-equiv="Content-type"/>\n
</head>\n<body>\n<p>This page contains characters from each of the
Unicode\ncharact'

If there was no obvious charset, “UTF-8” is usually a good guess, anyway.

Finally, convert the webpage to unicode text:

>>> text = data.decode('utf-8')

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to download page_source to a file. However, every time I get

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply