I have a simple program that loads a .json file which contains a funny character. The program (see below) runs fine in Terminal but gets this error in IntelliJ:
UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xe2 in position
2: ordinal not in range(128)
The crucial code is:
with open(jsonFileName) as f:
jsonData = json.load(f)
if I replace the open with:
with open(jsonFileName, encoding='utf-8') as f:
Then it works in both IntelliJ and Terminal. I’m still new to Python and the IntelliJ plugin, and I don’t understand why they’re different. I thought sys.path might be different, but the output makes me think that’s not the cause. Could someone please explain? Thanks!
Versions:
- OS: Mac OS X 10.7.4 (also tested on 10.6.8)
- Python 3.2.3 (v3.2.3:3d0686d90f55, Apr 10 2012, 11:25:50) /Library/Frameworks/Python.framework/Versions/3.2/bin/python3.2
- IntelliJ: 11.1.3 Ultimate
Files (2):
1. unicode-error-demo.py
#!/usr/bin/python
import json
from pprint import pprint as pp
import sys
def main():
if len(sys.argv) is not 2:
print(sys.argv[0], "takes one arg: a .json file")
return
jsonFileName = sys.argv[1]
print("sys.path:")
pp(sys.path)
print("processing", jsonFileName)
# with open(jsonFileName) as f: # OK in Terminal, but BUG in IntelliJ: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2: ordinal not in range(128)
with open(jsonFileName, encoding='utf-8') as f: # OK in both
jsonData = json.load(f)
pp(jsonData)
if __name__ == "__main__":
main()
2. encode-temp.json
["™"]
The JSON
.load()function expects Unicode data, not raw bytes. Python automatically tries to decode the byte string to a Unicode string for you using a default codec (in your case ASCII), and fails. By opening the file with theUTF-8codec, Python makes an explicit conversion for you. See theopen()function, which states:The encoding that would be used is determined as follows:
os.device_encoding()to see if there is a terminal encoding.locale.getpreferredencoding()function, which depends on the environment you run your code in. Thedo_setlocaleof that function is set toFalse.'ASCII'as a default if both methods have returnedNone.This is all done in C, but it’s python equivalent would be:
So when you run your program in a terminal,
os.deviceencoding()returns'UTF-8', but when running under IntelliJ there is no terminal, and if no locale is set either, python uses'ASCII'.The Python Unicode HOWTO tells you all about the difference between unicode strings and bytestrings, as well as encodings. Another essential article on the subject is Joel Spolsky’s Absolute Minimum Unicode knowledge article.