I have a python script that’s running periodically on Heroku using their Scheduler add-on. It prints some debug info, but when there’s a non-ASCII character in the text, I get an error in the logs like:
SyntaxError: Non-ASCII character '\xc2' in file send-tweet.py on line 40, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
That’s when I have a line like this in the script:
print u"Unicode test: £ ’ …"
I’m not sure what to do about this. If I have this in the script:
import locale
print u"Encoding: %s" % locale.getdefaultlocale()[1]
then this is output in the logs:
Encoding: UTF-8
So, why is it trying, and failing, to output other text in ASCII?
UPDATE: FWIW, here’s the actual script I’m using. The debugging output’s in line 38-39.
As the error says:
i.e there is no encoding declared in your Python source file.
The linked PEP tells you how to declare an encoding in your Python source: the encoding should be set to the table that your editor/IDE uses when you input the unicode character £ from your example. Most likely UTF-8 is assumed, so at the first line of your
send-tweet.pyput this:If the first line already contains a path directive like:
then put the encoding directive on the second line, e.g.
Also, when writing Unicode characters in your Python source and declaring UTF-8 encoding, you must use an editor with UTF-8 file saving support, i.e. an editor that can serialize Unicode code points to UTF-8.
In this regard, please note that Unicode and UTF-8 are not the same. Unicode refers to the standard, while UTF-8 is a specific encoding that determines how to serialize Unicode code points into a string that is compatible with ASCII and which uses 1 to 4 bytes to represent the original Unicode string.
So in the Python interpreter a string might be stored as Unicode, but if you want to write a Unicode string as UTF-8 you need to explicitly serialize the string to UTF-8 first, e.g.
This is important especially when outputting Unicode strings to byte-sized streams, e.g. when writing to a log file handle which typically assumes byte-sized characters, i.e. UTF-8 for content that contains non-ASCII characters.