I have a function like this: def convert_to_unicode(data): row = {} if data ==

Question

0

Asked: June 16, 20262026-06-16T06:55:40+00:00 2026-06-16T06:55:40+00:00

I have a function like this: def convert_to_unicode(data): row = {} if data ==

0

I have a function like this:

def convert_to_unicode(data):
    row = {}
    if data == None:
        return data
    try:
        for key, val in data.items():
            if isinstance(val, str):
                row[key] = unicode(val.decode('utf8'))
            else:
                row[key] = val
        return row
    except Exception, ex:
        log.debug(ex)

to which I feed a result set (got using MySQLdb.cursors.DictCursor) row by row to transform all the string values to unicode (example {'column_1':'XXX'} becomes {'column_1':u'XXX'}).

Problem is one of the rows has a value like {'column_1':'Gabriel García Márquez'}
and it does not get transformed. it throws this error:

'utf8' codec can't decode byte 0xed in position 12: invalid continuation byte

When I searched for this it seems that this has to do with ascii encoding.

The solutions i tried are:

adding # -*- coding: utf-8 -*- at the beginning of my file … does not help
changing the line row[key] = unicode(val.decode('utf8')) to row[key] = unicode(val.decode('utf8', 'ignore')) … as expected it ignores the non-ascii character and returns {'column_1':u'Gabriel Garca Mrquez'}
changing the line row[key] = unicode(val.decode('utf8')) to row[key] = unicode(val.decode('latin-1')) … Does the job but I am afraid it will support only West Europe characters (as per Here )

Can anybody point me towards a right direction please.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-16T06:55:41+00:00

Firstly:

The data you’re getting in your result set is clearly latin-1 encoded, or you wouldn’t be observing this behavior. It is entirely correct that trying to decode a latin-1-encoded byte string as though it were utf-8-encoded blows up in your face. Once you have a latin-1-encoded byte string foo, if you want to convert it to the unicode type, foo.decode('latin1') is the right thing to do.
I noticed the expression unicode(val.decode('utf8')) in your code. This is equivalent to just val.decode('utf8'); calling the .decode method of a byte string converts it to unicode, so you’re calling unicode() on a unicode string, which just returns the unicode string.

Secondly:

Your real problem here – if you want to be able to deal with characters not included in the character set supported by the latin-1 encoding – is not with Python’s string types, per se, so much as it is with the MySQLdb library. I don’t know this problem in intimate detail, but as I understand it, in ancient versions of MySQL, the default encoding used by MySQL databases was latin-1, but now it is utf-8 (and has been for many years). The MySQLdb library, however, still by default establishes latin-1-encoded connections with the database. There are literally dozens of StackOverflow questions relating to MySQL, Python, and string encoding, and while I don’t fully understand them, one easy-to-use solution to all such problems that seems to work for people is this one:
http://www.dasprids.de/blog/2007/12/17/python-mysqldb-and-utf-8

I wish I could give you a more comprehensive and confident answer on the MySQLdb issue, but I’ve never even used MySQL and I don’t want to risk posting anything untrue. Perhaps someone can come along and provide more detail. Nonetheless, I hope this helps you.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a function like this: def convert_to_unicode(data): row = {} if data ==

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply