I have a French dictionary file which I got from WinEdt.org (Zip File). I’d like to read this file into memory, but when I do I get the error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in
position 69: ordinal not in range(128)
I’ve also tried using the codecs module with the encoding utf-8, but that doesn’t work either:
with codecs.open(self.template_folder_path + "/" + self.test_language + ".txt",
'rb', encoding='utf-8') as fp:
word_list = []
for line in fp:
word_list.append(line.strip())
self.words[self.test_language] = word_list
How can I read this file? I also need to read in a few other dictionary files from that website. How do I go about that?
latin1 aka ISO-8859-1 is “a snare and a delusion”. Decoding random binary gibberish with
latin1“works”, because thelatin1codec maps all 256 bytes to a Unicode codepoint.In this case given the information (1) French (2) “WinEdt.org” (hello hello, that’s “Win” as in “Windows”). the file is likely to be encoded in
cp1252.Update: You asked about other files on that website. The first thing to do would be (as the site recommends) to read the .TXT file associated with the dictionary. For example, the large Russian dictionary’s .TXT file says “The dictionary assumes standard Windows Russian codepage (1251)”. Failing that, try the most appropriate from this list:
cp1250 eastern European Latin-based scripts e.g. Polish, Czech, Serbian (Latin script)
cp1251 Cyrillic-based scripts e.g. Russian, Ukrainian, Serbian (Cyrillic script)
cp1252 western European Latin-based scripts e.g. German, French
cp1253 Greek
cp1254 Turkish
cp1255 Hebrew
cp1256 Arabic
cp1257 Estonian, Latvian and Lithuanian
cp1258 Vietnamese