If I have an object like:
d = {'a':1, 'en': 'hello'}
…then I can pass it to urllib.urlencode, no problem:
percent_escaped = urlencode(d)
print percent_escaped
But if I try to pass an object with a value of type unicode, game over:
d2 = {'a':1, 'en': 'hello', 'pt': u'olá'}
percent_escaped = urlencode(d2)
print percent_escaped # This fails with a UnicodeEncodingError
So my question is about a reliable way to prepare an object to be passed to urlencode.
I came up with this function where I simply iterate through the object and encode values of type string or unicode:
def encode_object(object):
for k,v in object.items():
if type(v) in (str, unicode):
object[k] = v.encode('utf-8')
return object
This seems to work:
d2 = {'a':1, 'en': 'hello', 'pt': u'olá'}
percent_escaped = urlencode(encode_object(d2))
print percent_escaped
And that outputs a=1&en=hello&pt=%C3%B3la, ready for passing to a POST call or whatever.
But my encode_object function just looks really shaky to me. For one thing, it doesn’t handle nested objects.
For another, I’m nervous about that if statement. Are there any other types that I should be taking into account?
And is comparing the type() of something to the native object like this good practice?
type(v) in (str, unicode) # not so sure about this...
Thanks!
You should indeed be nervous. The whole idea that you might have a mixture of bytes and text in some data structure is horrifying. It violates the fundamental principle of working with string data: decode at input time, work exclusively in unicode, encode at output time.
Update in response to comment:
You are about to output some sort of HTTP request. This needs to be prepared as a byte string. The fact that urllib.urlencode is not capable of properly preparing that byte string if there are unicode characters with ordinal >= 128 in your dict is indeed unfortunate. If you have a mixture of byte strings and unicode strings in your dict, you need to be careful. Let’s examine just what urlencode() does:
The last two tests demonstrate the problem with urlencode(). Now let’s look at the str tests.
If you insist on having a mixture, then you should at the very least ensure that the str objects are encoded in UTF-8.
‘\x80’ is suspicious — it is not the result of any_valid_unicode_string.encode(‘utf8’).
‘\xe2\x82\xac’ is OK; it’s the result of u’\u20ac’.encode(‘utf8’).
‘1’ is OK — all ASCII characters are OK on input to urlencode(), which will percent-encode such as ‘%’ if necessary.
Here’s a suggested converter function. It doesn’t mutate the input dict as well as returning it (as yours does); it returns a new dict. It forces an exception if a value is a str object but is not a valid UTF-8 string. By the way, your concern about it not handling nested objects is a little misdirected — your code works only with dicts, and the concept of nested dicts doesn’t really fly.
and here’s the output, using the same tests in reverse order (because the nasty one is at the front this time):
Does that help?