i’ve been trying to mass-edit a bunch of text files to utf-8 in python and this error keeps popping out. is there a way to replace them in some python scrips or bash commands?
i used the code:
writer = codecs.open(os.path.join(wrd, 'dict.en'), 'wtr', 'utf-8')
for infile in glob.glob(os.path.join(wrd,'*.txt')):
print infile
for line in open(infile):
writer.write(line.encode('utf-8'))
and got these sorts of errors:
Traceback (most recent call last):
File "dicting.py", line 30, in <module>
writer.write(line2.encode('utf-8'))
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 216: unexpected code byte
OK, first point: your output file is set to automatically encode text written to it as
utf-8, so don’t include an explicitencode('utf-8')method call when passing arguments to thewrite()method.So the first thing to try is to simply use the following in your inner loop:
If that doesn’t work, then the problem is almost certainly the fact that, as others have noted, you aren’t decoding your input file properly.
Taking a wild guess and assuming that your input files are encoded in
cp1252, you could try as a quick test the following in the inner loop:Minor point: ‘wtr’ is a nonsensical mode string (as write access implies read access). Simplify it to either ‘wt’ or even just ‘w’.