I have a function like this:
def convert_to_unicode(data):
row = {}
if data == None:
return data
try:
for key, val in data.items():
if isinstance(val, str):
row[key] = unicode(val.decode('utf8'))
else:
row[key] = val
return row
except Exception, ex:
log.debug(ex)
to which I feed a result set (got using MySQLdb.cursors.DictCursor) row by row to transform all the string values to unicode (example {'column_1':'XXX'} becomes {'column_1':u'XXX'}).
Problem is one of the rows has a value like {'column_1':'Gabriel García Márquez'}
and it does not get transformed. it throws this error:
'utf8' codec can't decode byte 0xed in position 12: invalid continuation byte
When I searched for this it seems that this has to do with ascii encoding.
The solutions i tried are:
-
adding
# -*- coding: utf-8 -*-at the beginning of my file … does not help -
changing the line
row[key] = unicode(val.decode('utf8'))torow[key] = unicode(val.decode('utf8', 'ignore'))… as expected it ignores the non-ascii character and returns{'column_1':u'Gabriel Garca Mrquez'} -
changing the line
row[key] = unicode(val.decode('utf8'))torow[key] = unicode(val.decode('latin-1'))… Does the job but I am afraid it will support only West Europe characters (as per Here )
Can anybody point me towards a right direction please.
Firstly:
The data you’re getting in your result set is clearly
latin-1encoded, or you wouldn’t be observing this behavior. It is entirely correct that trying to decode alatin-1-encoded byte string as though it wereutf-8-encoded blows up in your face. Once you have alatin-1-encoded byte stringfoo, if you want to convert it to the unicode type,foo.decode('latin1')is the right thing to do.I noticed the expression
unicode(val.decode('utf8'))in your code. This is equivalent to justval.decode('utf8'); calling the.decodemethod of a byte string converts it to unicode, so you’re callingunicode()on a unicode string, which just returns the unicode string.Secondly:
latin-1encoding – is not with Python’s string types, per se, so much as it is with the MySQLdb library. I don’t know this problem in intimate detail, but as I understand it, in ancient versions of MySQL, the default encoding used by MySQL databases waslatin-1, but now it isutf-8(and has been for many years). The MySQLdb library, however, still by default establisheslatin-1-encoded connections with the database. There are literally dozens of StackOverflow questions relating to MySQL, Python, and string encoding, and while I don’t fully understand them, one easy-to-use solution to all such problems that seems to work for people is this one:http://www.dasprids.de/blog/2007/12/17/python-mysqldb-and-utf-8
I wish I could give you a more comprehensive and confident answer on the MySQLdb issue, but I’ve never even used MySQL and I don’t want to risk posting anything untrue. Perhaps someone can come along and provide more detail. Nonetheless, I hope this helps you.