I’m trying to write a Python C extension that processes byte strings, and I have something basically working for Python 2.x and Python 3.x.
For the Python 2.x code, near the start of my function, I currently have a line:
if (!PyArg_ParseTuple(args, "s#:in_bytes", &src_ptr, &src_len))
...
I notice that the s# format specifier accepts both Unicode strings and byte strings. I really just want it to accept byte strings and reject Unicode. For Python 2.x, this might be “good enough”–the standard hashlib seems to do the same, accepting Unicode as well as byte strings. However, Python 3.x is meant to clean up the Unicode/byte string mess and not let the two be interchangeable.
So, I’m surprised to find that in Python 3.x, the s format specifiers for PyArg_ParseTuple() still seem to accept Unicode and provide a “default encoded string version” of the Unicode. This seems to go against the principles of Python 3.x, making the s format specifiers unusable in practice. Is my analysis correct, or am I missing something?
Looking at the implementation for hashlib for Python 3.x (e.g. see md5module.c, function MD5_update() and its use of GET_BUFFER_VIEW_OR_ERROUT() macro) I see that it avoids the s format specifiers, and just takes a generic object (O specifier) and then does various explicit type checks using the GET_BUFFER_VIEW_OR_ERROUT() macro. Is this what we have to do?
I agree with you — it’s one of several spots where the C API migration of Python 3 was clearly not designed as carefully and thouroughly as the Python coder-visible parts. I do also agree that probably the best workaround for now is focusing on “buffer views”, per that macro — until and unless something better gets designed into a future Python C API (don’t hold your breath waiting for that to happen, though;-).