I need to test if a string is Unicode, and then if it whether it’s UTF-8. After that, get the string’s length in bytes including the BOM, if it ever uses that. How can this be done in Python?
Also for didactic purposes, what does a byte list representation of a UTF-8 string look like? I am curious how a UTF-8 string is represented in Python.
Latter edit: pprint does that pretty well.
In Python 2,
stris a sequence of bytes andunicodeis a sequence of characters. You usestr.decodeto decode a byte sequence tounicode, andunicode.encodeto encode a sequence of characters tostr. So for example,u"é"is the unicode string containing the single character U+00E9 and can also be writtenu"\xe9"; encoding into UTF-8 gives the byte sequence"\xc3\xa9".In Python 3, this is changed;
bytesis a sequence of bytes andstris a sequence of characters.