This is a section from Dive Into Python 3 regarding strings:
In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in utf-8, or a Python string encoded as CP-1252. “Is this string utf-8?” is an invalid question. utf-8 is a way of encoding characters as a sequence of bytes. If you want to take a string and turn it into a sequence of bytes in a particular character encoding, Python 3 can help you with that. If you want to take a sequence of bytes and turn it into a string, Python 3 can help you with that too. Bytes are not characters; bytes are bytes. Characters are an abstraction. A string is a sequence of those abstractions.
Earlier today I used the hashlib module and read the help text for md5 that says:
Return a new MD5 hash object; optionally initialized with a string.
Well, it doesn’t accept a string – it accepts a bytes object.
Maybe I’m reading too much into this, but wouldn’t it make more sense if the help text stated a bytes should be used instead? Or are people using the same name for strings and bytes?
In Python 2 and 3,
strwas used both for strings of characters as well as bytes. In Fact, until Python 2.6, there wasn’t even abytestype (and in 2.6 and 2.7,bytes is str).The mentioned inconsistencies in the hashlib documentation are an artifact of this history.