I want my function to take an argument that could be an unicode object

Question

0

Asked: May 17, 20262026-05-17T06:32:31+00:00 2026-05-17T06:32:31+00:00

I want my function to take an argument that could be an unicode object

0

I want my function to take an argument that could be an unicode object or a utf-8 encoded string. Inside my function, I want to convert the argument to unicode. I have something like this:

def myfunction(text):
    if not isinstance(text, unicode):
        text = unicode(text, 'utf-8')

    ...

Is it possible to avoid the use of isinstance? I was looking for something more duck-typing friendly.

During my experiments with decoding, I have run into several weird behaviours of Python. For instance:

>>> u'hello'.decode('utf-8')
u'hello'
>>> u'cer\xf3n'.decode('utf-8')
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in po
sition 3: ordinal not in range(128)

Or

>>> u'hello'.decode('utf-8')
u'hello' 12:11
>>> unicode(u'hello', 'utf-8')
Traceback (most recent call last):
File "<input>", line 1, in <module>
TypeError: decoding Unicode is not supported

By the way. I’m using Python 2.6

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-17T06:32:45+00:00

You could just try decoding it with the ‘utf-8’ codec, and if that does not work, then return the object.

def myfunction(text):
    try:
        text = unicode(text, 'utf-8')
    except TypeError:
        return text

print(myfunction(u'cer\xf3n'))
# cerón

When you take a unicode object and call its decode method with the 'utf-8' codec, Python first tries to convert the unicode object to a string object, and then it calls the string object’s decode(‘utf-8’) method.

Sometimes the conversion from unicode object to string object fails because Python2 uses the ascii codec by default.

So, in general, never try to decode unicode objects. Or, if you must try, trap it in a try..except block. There may be a few codecs for which decoding unicode objects works in Python2 (see below), but they have been removed in Python3.

See this Python bug ticket for an interesting discussion of the issue,
and also Guido van Rossum’s blog:

“We are adopting a slightly different
approach to codecs: while in Python 2,
codecs can accept either Unicode or
8-bits as input and produce either as
output, in Py3k, encoding is always a
translation from a Unicode (text)
string to an array of bytes, and
decoding always goes the opposite
direction. This means that we had to
drop a few codecs that don’t fit in
this model, for example rot13, base64
and bz2 (those conversions are still
supported, just not through the
encode/decode API).”

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I want my function to take an argument that could be an unicode object

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply