I have a document in Spanish I’d like to format using Python. Problem is that in the output file, the accented characters are messed up, in this manner: \xc3\xad.
I succeeded in keeping the proper characters when I did some similar editing a while back, and although I’ve tried everything I did then and more, somehow it won’t work this time.
This is current version of the code:
# -*- coding: utf-8 -*-
import re
import pickle
inputfile = open("input.txt").read()
pat = re.compile(r"(@.*\*)")
mylist = pat.findall(inputfile)
outputfile = open("output.txt", "w")
pickle.dump(mylist, outputfile)
outputfile.close()
I’m using Python 2.7 on Windows 7.
Can anyone see any obvious problems? The inputfile is encoded in utf-8, but I’ve tried encoding it latin-1 too. Thanks.
To clarify: My problem is that the latin characters doesn’t show up properly in the output.
It’s solved now, I just had to add this line as suggested by mata:
inputfile = inputfile.decode('utf-8')
it the input file is encoded in
utf-8, then you shoulddecodeit first to work with it:the so created file will contain a pickled version of your list. it you would rather hava a human readable file, then you might want to just use a plain file.
also a good way to deal with different encodings is using the
codecsmodule: