It throws out “UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xc2 in position 2: ordinal not in range(128)” when executing following code:
filename = 'Spywaj.ttf'
print repr(filename)
>> 'Sp\xc2\x88ywaj.ttf'
filepath = os.path.join('/dirname', filename)
But the file is valid and existed on disk. Filename was extracted from “unzip -l” command. How can join filenames like this?
OS and filesystem
Filesystem: ext3 relatime,errors=remount-ro 0 0
Locale: en_US.UTF-8
Alex’s suggestion os.path.join works now but I still cannot access the file on disk with the filename it joined.
filename = filename.decode('utf-8')
filepath = os.path.join('/dirname', filename)
print filepath
>> /dirname/u'Sp\xc2\x88ywaj.ttf'
print os.path.isfile(filepath)
>> False
new_filepath = filepath.encode('Latin-1').encode('utf-8')
print new_filepath
>> /dirname/u'Sp\xc2\x88ywaj.ttf'
print type(filepath)
>> <type 'unicode'>
print os.path.isfile(new_filepath)
>> False
valid_filepath = glob.glob('/dirname/*.ttf')[0]
print valid_filepath
>> /dirname/Spywaj.ttf (SO cannot display the chars in filename)
print type(valid_filepath)
>> <type 'str'>
print os.path.isfile(valid_filepath)
>> True
In both Latin-1 (ISO-8859-1) and Windows-1252, 0xc2 would a capital A with a circumflex accent… doesn’t seem to be anywhere in the code you show! Can you please add a
before the
os.path.joincall (and also put the'/dirname'in a variable and print its repr for completeness?). I’m thinking that maybe that stray character is there but you’re not seeing it for some reason — thereprwill reveal it.If you do have a Latin-1 (or Win-1252) non-Ascii character in your filename, you have to use Unicode — and/or, depending on your OS and filesystem, some specific encoding thereof.
Edit: the OP confirms, thanks to
repr, that there are actually two bytes that can’t possibly be ASCII — 0xc2 then 0x88, corresponding to what the OP thinks is one lowercase L.Well, that sequence would be a Unicode uppercase A with caret (codepoint 0x88) in the justly popular UTF-8 encoding – how that could look like a lowercase L to the OP beggars explanation, but I imagine some fonts could be graphically crazy enough to afford such confusion.
So I would first try
filename = filename.decode('utf-8')— that should allow theos.path.jointo work. Ifopenthen balks at the resulting Unicode string (it might work, depending on the filesystem and OS), next attempt is to try using that Unicode object’s.encode('Latin-1')and.encode('utf-8'). If none of the encodings work, information on the OS and filesystem in use, which the OP, I believe, hasn’t given yet, becomes crucial.