I have embedded a Python interpreter in a C program. Suppose the C program reads some bytes from a file into a char array and learns (somehow) that the bytes represent text with a certain encoding (e.g., ISO 8859-1, Windows-1252, or UTF-8). How do I decode the contents of this char array into a Python string?
The Python string should in general be of type unicode—for instance, a 0x93 in Windows-1252 encoded input becomes a u'\u0201c'.
I have attempted to use PyString_Decode, but it always fails when there are non-ASCII characters in the string. Here is an example that fails:
#include <Python.h> #include <stdio.h> int main(int argc, char *argv[]) { char c_string[] = { (char)0x93, 0 }; PyObject *py_string; Py_Initialize(); py_string = PyString_Decode(c_string, 1, 'windows_1252', 'replace'); if (!py_string) { PyErr_Print(); return 1; } return 0; }
The error message is UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128), which indicates that the ascii encoding is used even though we specify windows_1252 in the call to PyString_Decode.
The following code works around the problem by using PyString_FromString to create a Python string of the undecoded bytes, then calling its decode method:
#include <Python.h> #include <stdio.h> int main(int argc, char *argv[]) { char c_string[] = { (char)0x93, 0 }; PyObject *raw, *decoded; Py_Initialize(); raw = PyString_FromString(c_string); printf('Undecoded: '); PyObject_Print(raw, stdout, 0); printf('\n'); decoded = PyObject_CallMethod(raw, 'decode', 's', 'windows_1252'); Py_DECREF(raw); printf('Decoded: '); PyObject_Print(decoded, stdout, 0); printf('\n'); return 0; }
PyString_Decode does this:
IOW, it does basically what you’re doing in your second example – converts to a string, then decode the string. The problem here arises from PyString_AsDecodedString, rather than PyString_AsDecodedObject. PyString_AsDecodedString does PyString_AsDecodedObject, but then tries to convert the resulting unicode object into a string object with the default encoding (for you, looks like that’s ASCII). That’s where it fails.
I believe you’ll need to do two calls – but you can use PyString_AsDecodedObject rather than calling the python ‘decode’ method. Something like:
I’m not entirely sure what the reasoning behind PyString_Decode working this way is. A very old thread on python-dev seems to indicate that it has something to do with chaining the output, but since the Python methods don’t do the same, I’m not sure if that’s still relevant.