I’m a German developer writing web applications for Germans, which means I cannot by any means rely on plain ASCII encoding. At least characters like ä, ö, ü, ß have to be supported.
Fortunately, Django treats ByteStrings as utf-8 encoded by default (as described in the docs). So it should just work, if I add the # -*- coding: utf-8 -*- line to the beginning of each .py file and set the editor encoding, shouldn’t it? Well, it does most of the time…
But I seem to miss something when it comes to URLs. Or maybe that has not to do anything with URLs but until now I didn’t notice any other encoding misbehavior. There are two cases I can remember as examples:
The URL pattern url(r'^([a-z0-9äöüß_\-]+)/$', views.view_page) doesn’t recognize URLs containing ä, ö, ü, ß at all. Those characters are simply ignored.
The following code of a view function throws an Exception:
def do_redirect(request, id):
return redirect('/page/{0}'.format(id))
Where the id argument is captured from the URL like the one in the first example. If I fix the URL pattern (by specifying it as unicode string) and than access /ä/, I get the Exception
UnicodeEncodeError at /ä/
'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)
However, trying the following code for the view function:
def do_redirect(request, id):
return redirect('/page/' + id)
everything works out fine. That makes me belief the actual problem lies not within Django but derives from Python, treating ByteStrings as ASCII. I’m not that much into encoding but the problem in the second example is obviously the format() method of the String object. So, in the first example it might fail because of the way Python handles regular expressions (though I don’t know if Django uses the re module or something else).
My workaround until now is just prefixing the string with u whenever such an error occurs. That’s a bad solution since I might easily overlook something. I tried marking every Python string as unicode but that causes other exceptions and is quite ugly.
Does anyone know exactly, what the problem is and how to solve it in a pleasant way (i.e. a way that doesn’t let your head explode when the code grows bigger)?
Thanks in advance!
EDIT: For my regular expression I found out, why the u is needed. Specifying a string as Raw String (r) makes it being interpreted as ASCII. Leaving the r away makes the regex work without the u but introduces some headache with backslashes.
Prefixing your strings with
uis the solution.If it’s a problem for you, then it looks like a symptom of a more general problem: you have a lot of magic constants in your code. It is bad (and you already see why). Try to avoid them, for example you can use named url pattern or view name for redirecting instead of re-typing the part of URL.
If you can’t avoid them, turn them into named constants, and place their assignments in one place. Then, you’ll see that all of them are prefixed properly, and it will be difficult to overlook it.