I’ve a programm that gets an argument from the shell. This argument will be the query used in a search operation.
If I pass in English words (i.e. no accents, etc.), it works fine. Nevertheless, if I pass in, namely, ‘café’, I get ‘cafú’ (print sys.argv[1] results in cafÚ instead of café).
I thought I could solve the problem by converting it into a Unicode object, but I was wrong.
Q = unicode(sys.argv[1], encoding=sys.stdin.encoding)
I still get ‘cafÚ’!! I’m going crazy…
I bet you’re on Windows, right?
Use
encoding="cp1252"instead, then it should work.Explanation: (with some guesswork)
cmdwindows usecp850as their default codepage. This is evident from the second line in my session above,0x82iséincp850.cp1252as their standard encoding, shown by the last line of the session above:éis0xe9incp1252(like in Unicode).cp1252):If I do
f.write(a), I getcaf,as the contents of my file because,is0x82incp1252).If I do
f.write(a.decode("cp850").encode("cp1252")), I getcafé.Moral: Find out the correct encodings in your environment, convert everything to Unicode as soon as possible, work with it, then convert back to the encoding you need. If you’re outputting into an interactive window, use
cp850, if you’re outputting into a file, usecp1252.Or switch to Python 3 which makes all of this much easier.