I am using argparse to read in arguments for my python code. One of those inputs is a title of a file [title] which can contain Unicode characters. I have been using 22少女時代22 as a test string.
I need to write the value of the input title to a file, but when I try to convert the string to UTF-8 it always throws an error:
UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0x8f in position 2: ordinal
not in range(128)
I have been looking around and see I need my string to be in the form u"foo" to call .encode() on it.
When I run type() on my input from argparse I see:
<type 'str'>
I am looking to get a response of:
<type 'unicode'>
How can I get it in the right form?
Idea:
Modify argparse to take in a str but store it as a unicode string u"foo":
parser.add_argument(u'title', metavar='T', type=unicode, help='this will be unicode encoded.')
This approach is not working at all. Thoughts?
Edit 1:
Some sample code where title is 22少女時代22:
inputs = vars(parser.parse_args())
title = inputs["title"]
print type(title)
print type(u'foo')
title = title.encode('utf8') # This line throws the error
print title
It looks like your input data is in SJIS encoding (a legacy encoding for Japanese), which produces the byte 0x8f at position 2 in the bytestring:
(At Python 3 prompt)
Now,
I’m guessing thatto “convert the string to UTF-8”, you used something likeThe problem is that
titleis actually a bytestring containing the SJIS-encoded string. Due to a design flaw in Python 2, bytestrings can be directlyencoded, and it assumes the bytestring is ASCII-encoded. So what you have is conceptually equivalent toand of course the
decodecall fails.You should instead explicitly decode from SJIS to a Unicode string, before encoding to UTF-8:
As Mark Tolonen pointed out, you’re probably typing the characters into your console, and it’s your console encoding is a non-Unicode encoding.
So it turns out your
sys.stdin.encodingiscp932, which is Microsoft’s variant of SJIS. For this, useYou really should set your console encoding to the standard UTF-8, but I’m not sure if that’s possible on Windows. If you do, you can skip the decoding/encoding step and just write your input bytestring to the file.