I created a file containing a dictionary with data written in Spanish (i.e. Damián, etc.):
fileNameX.write(json.dumps(dictionaryX, indent=4))
The data come from some fql fetching operations, i.e.:
select name from user where uid in XXX
When I open the file, I find that, for instance, “Damián” looks like “Dami\u00e1n”.
I’ve tried some options:
-
ensure_ascii=False:
fileNameX.write(json.dumps(dictionaryX, indent=4, ensure_ascii=False))But I get an error (UnicodeEncodeError: ‘ascii’ codec can´t encode character u’\xe1′ in position XXX: ordinal not in range(128)).
-
encode(encoding=’latin-1):
dictionaryX.append({ 'name': unicodeVar.encode(encoding='latin-1'), ... })But I get another error (UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0xe1 in position XXX: invalid continuation byte)
To sum up, I’ve tried several possibilities, but have less than a clue. I’m lost. Please, I need help. Thanks!
You have many options, and have stumbled upon something rather complicated that depends on your Python version and which you absolutely must understand fully in order to write correct code. Generally the approach taken in 3.x is stricter and a bit harder to work with, but it is much less likely that you will make a mistake or get yourself into a complicated situation. (Based on the exact symptoms you report, you seem to be using 2.x.)
json.dumpshas different behaviour in 2.x and 3.x. In 2.x, it produces astr, which is a byte-string (unknown encoding). In 3.x, it still produces astr, but nowstrin 3.x is a proper Unicode string.JSON is inherently a Unicode-supporting format, but it expects files to be in UTF-8 encoding. However, please understand that JSON supports
\ustyle escapes in strings. When you read in this data, you will get the correct encoded string back. The reading code produces unicode objects (no matter whether you use 2.x or 3.x) when it reads strings out of the JSON.ácannot be represented in ASCII. It gets encoded as\u00e1by default, to avoid the other problems you had. This happens even in 3.x.This disables the previous encoding. In 2.x, it means you get a
unicodeobject instead – a real Unicode object, preserving the originalácharacter. In 3.x, it means that the character is not explicitly translated. But either way,ensure_ascii=Falsemeans thatjson.dumpswill give you a Unicode string.Unicode strings must be encoded to be written to a file. There is no such thing as “unicode data”; Unicode is an abstraction. In 2.x, this encoding is implicitly
'ascii'when you feed a Unicode object tofile.write; it was expecting astr. To get around this, you can use thecodecsmodule, or explicitly encode as'utf-8'before writing. In 3.x, the encoding is set with theencodingkeyword argument when youopenthe file (the default is again probably not what you want).Here, you are encoding before producing the dictionary, so that you have a
strobject in your data. Now a problem occurs because when there arestrobjects in your data, the JSON encoder assumes, by default, that they represent Unicode strings in UTF-8 encoding. This can be changed, in 2.x, using theencodingkeyword argument tojson.dumps. (In 3.x, the encoder will simply refuse to serializebytesobjects, i.e. non-Unicode strings!)However, if your goal is simply to get the data into the file directly, then
json.dumpsis the wrong tool for you. Have you wondered what thatsin the name is for? It stands for “string”; this is the special case. The ordinary case, in fact, is writing directly to a file! (Instead of giving you a string and expecting you to write it yourself.) Which is whatjson.dump(no ‘s’) does. Again, the JSON standard expects UTF-8 encoding, and again 2.x has anencodingkeyword parameter that defaults to UTF-8 (you should leave this alone).