I can’t get a grip on how Python handles Unicode in files…
f = open('test.txt', 'w')
f.write('abc')
f.close()
That gives a file of 3 bytes.
f = open('test.txt', 'w')
f.write('abcé')
f.close()
That gives a file of 5 bytes (the é takes up two bytes but how does Python knows that it must read 2 bytes there?)
f = open('test.txt', 'w')
f.write('abcそ') # a Japanese character
f.close()
That gives a file of 6 bytes (the そ takes up three bytes but how does Python knows that it must read 3 bytes there?)
So I can understand that Unicode takes two bytes, but it is sometimes 1, or 2 or 3 bytes, I fail to see how it works.
By default, it writes the output file with an encoding of UTF-8, which is a variable-length encoding: it encodes ASCII characters (code points U+0000-U+007F) using 1 byte, code points U+0080-U+07FF (which includes Latin-1 characters such as é) using 2 bytes, code points U+0800-U+FFFF (which includes Chinese and Japanese characters such as そ) using 3 bytes, and code points U+10000-U+10FFFF using 4 bytes.
If you want to use a different encoding, such as UTF-16, you can use
str.encodeto use your desired encoding: