I need fastest way to convert files from latin1 to utf-8 in python. The files are large ~ 2G. ( I am moving DB data ). So far I have
import codecs
infile = codecs.open(tmpfile, 'r', encoding='latin1')
outfile = codecs.open(tmpfile1, 'w', encoding='utf-8')
for line in infile:
outfile.write(line)
infile.close()
outfile.close()
but it is still slow. The conversion takes one fourth of the whole migration time.
I could also use a linux command line utility if it is faster than native python code.
You could use blocks larger than one line, and do binary I/O — each might speed thinks up a bit (though on Linux binary I/O won’t, as it’s identical to text I/O):
The byte-by-byte parsing implied in by-line reading, line-end conversion (not on Linux;-), and codecs.open-style encoding-decoding, should be part of what’s slowing you down. This approach is also portable (like yours is), since control-characters such as
\nneed no translation among these codecs anyway (in any OS).This only works for input codecs that have no multibyte characters, but `latin1′ is one of those (it does not matter whether the output codec has such characters or not).
Try different block sizes to find the sweet spot performance-wise, depending on your disk, filesystem and available RAM.
Edit: changed code per @John’s comment, and clarified a conditon as per @gnibbler’s.