I parsed a file and saved its content in a database using Django. The website was 100% in English, so I naively assumed it would be ASCII all along, and saved the text happily as unicode.
You guess the rest of the story 🙂
When I print, I get the usual encoding error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 48: ordinal not in range(128)
A quick search tells me that u’\u2019′ is the UTF-8 representation of ’.
repr(string) displays me this:
"u'his son\\u2019s friend'"
Then of course I tried django.utils.encoding.smart_str and a more direct approach using string.encode(‘utf-8’), and I ended up with something printable. Unfortunatly, it prints like this in my (linux UTF-8) terminal:
In [76]: repr(string.encode('utf-8'))
Out[76]: "'his son\\xe2\\x80\\x99s friend '"
In [77]: print string.encode('utf-8')
his son�s friend
Not what I expected. I suspect I double encoded something or missed an important point.
Of course the file original encoding is not pusblished with the file. I guess I could read the HTTP headers or ask the webmaster but since \u2019s looks like UTF-8, I assumed it was utf-8. I can be very wrong, tell me if I am.
Solutions obviously appreciated, but a deep explanation on the cause and what to do to avoid this to happen again would be even more. I often get bitten with encoding, which shows that I still don’t master completly the subject.
You are fine. You have the proper data. Yes, the original data is UTF-8 (based on context u2019 makes perfect sense as an apostrophe between “son” and “s”). The weird
?error character probably just means your terminal configuration’s font doesn’t have a glyph for this character (fancy apostrophe). No big deal. The data will be correct where it counts. If you are nervous, try some different terminal/OS combinations (I’m on OS X using iTerm). I spent a lot of time explaining to my QA guys that the scary?question mark character just means they don’t have a Chinese font installed on their windows box (In my case we were testing with Chinese data). Here’s some commentsSee also: http://www.cl.cam.ac.uk/~mgk25/ucs/quotes.html
See also for character 2019 (e28099 in hex, search for “2019” on this page): http://www.utf8-chartable.de/unicode-utf8-table.pl?start=8000
See also: http://www.joelonsoftware.com/articles/Unicode.html