I’m having some brain failure in understanding reading and writing text to a file (Python 2.4).
# The string, which has an a-acute in it. ss = u'Capit\xe1n' ss8 = ss.encode('utf8') repr(ss), repr(ss8)
(‘u’Capit\xe1n”, ”Capit\xc3\xa1n”)
print ss, ss8 print >> open('f1','w'), ss8 >>> file('f1').read() 'Capit\xc3\xa1n\n'
So I type in Capit\xc3\xa1n into my favorite editor, in file f2.
Then:
>>> open('f1').read() 'Capit\xc3\xa1n\n' >>> open('f2').read() 'Capit\\xc3\\xa1n\n' >>> open('f1').read().decode('utf8') u'Capit\xe1n\n' >>> open('f2').read().decode('utf8') u'Capit\\xc3\\xa1n\n'
What am I not understanding here? Clearly there is some vital bit of magic (or good sense) that I’m missing. What does one type into text files to get proper conversions?
What I’m truly failing to grok here, is what the point of the UTF-8 representation is, if you can’t actually get Python to recognize it, when it comes from outside. Maybe I should just JSON dump the string, and use that instead, since that has an asciiable representation! More to the point, is there an ASCII representation of this Unicode object that Python will recognize and decode, when coming in from a file? If so, how do I get it?
>>> print simplejson.dumps(ss) ''Capit\u00e1n'' >>> print >> file('f3','w'), simplejson.dumps(ss) >>> simplejson.load(open('f3')) u'Capit\xe1n'
In the notation
u'Capit\xe1n\n'(should be just'Capit\xe1n\n'in 3.x, and must be in 3.0 and 3.1), the\xe1represents just one character.\xis an escape sequence, indicating thate1is in hexadecimal.Writing
Capit\xc3\xa1ninto the file in a text editor means that it actually contains\xc3\xa1. Those are 8 bytes and the code reads them all. We can see this by displaying the result:Instead, just input characters like
áin the editor, which should then handle the conversion to UTF-8 and save it.In 2.x, a string that actually contains these backslash-escape sequences can be decoded using the
string_escapecodec:The result is a
strthat is encoded in UTF-8 where the accented character is represented by the two bytes that were written\\xc3\\xa1in the original string. To get aunicoderesult, decode again with UTF-8.In 3.x, the
string_escapecodec is replaced withunicode_escape, and it is strictly enforced that we can onlyencodefrom astrtobytes, anddecodefrombytestostr.unicode_escapeneeds to start with abytesin order to process the escape sequences (the other way around, it adds them); and then it will treat the resulting\xc3and\xa1as character escapes rather than byte escapes. As a result, we have to do a bit more work: