I’m trying to use urllib and urllib2 to read from a text file that has french characters in it, like “é”, “à”, and so on.
def load(url):
from urllib2 import Request, urlopen, URLError, HTTPError
req = Request(url)
f = urlopen(req)
f.readline()
for line in f:
line = line.split('\t')
word = line[0].encode('utf-8')
I have a feeling that the read() method returns me a byte string, so I use encode(‘utf-8’) to get the unicode value, but this gives me the following error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 6: ordinal not in range(128)
Can someone tell me what’s going on? Any help would be appreciated. Thanks!
Yes, you’re reading bytes from the file. What you must do is decode, not encode, the byte string into Unicode. It’s already encoded, you see. If it wasn’t, you wouldn’t need to do anything with it.
You have to specify the encoding used in the file. If it’s not
utf8, another good suspect might belatin1. Or, you know, since it’s a Web document, you could fish the document’s encoding out of the headers and/or its content, but that’s a little beyond the scope of your question.