I’ve set up a script that basically does a large-scale find-and-replace on a plain text document.
At the moment it works fine with ASCII, UTF-8, and UTF-16 (and possibly others, but I’ve only tested these three) encoded documents so long as the encoding is specified inside the script (the example code below specifies UTF-16).
Is there a way to make the script automatically detect which of these character encodings is being used in the input file and automatically set the character encoding of the output file the same as the encoding used on the input file?
findreplace = [
('term1', 'term2'),
]
inF = open(infile,'rb')
s=unicode(inF.read(),'utf-16')
inF.close()
for couple in findreplace:
outtext=s.replace(couple[0],couple[1])
s=outtext
outF = open(outFile,'wb')
outF.write(outtext.encode('utf-16'))
outF.close()
Thanks!
From the link J.F. Sebastian posted: try chardet.
Keep in mind that in general it’s impossible to detect the character encoding of every input file 100% reliably – in other words, there are possible input files which could be interpreted equally well as any of several character encodings, and there may be no way to tell which one is actually being used. chardet uses some heuristic methods and gives you a confidence level indicating how “sure” it is that the character encoding it tells you is actually correct.