I have a UTF-16 CSV file which I have to read. Python csv module does not seem to support UTF-16.
I am using python 2.7.2. CSV files I need to parse are huge size running into several GBs of data.
Answers for John Machin questions below
print repr(open('test.csv', 'rb').read(100))
Output with test.csv having just abc as content
'\xff\xfea\x00b\x00c\x00'
I think csv file got created on windows machine in USA. I am using Mac OSX Lion.
If I use code provided by phihag and test.csv containing one record.
sample test.csv content used. Below is print repr(open(‘test.csv’, ‘rb’).read(1000)) output
'\xff\xfe1\x00,\x002\x00,\x00G\x00,\x00S\x00,\x00H\x00 \x00f\x00\xfc\x00r\x00 \x00e\x00 \x00\x96\x00 \x00m\x00 \x00\x85\x00,\x00,\x00I\x00\r\x00\n\x00'
Code by phihag
import codecs
import csv
with open('test.csv','rb') as f:
sr = codecs.StreamRecoder(f,codecs.getencoder('utf-8'),codecs.getdecoder('utf-8'),codecs.getreader('utf-16'),codecs.getwriter('utf-16'))
for row in csv.reader(sr):
print row
Output of the above code
['1', '2', 'G', 'S', 'H f\xc3\xbcr e \xc2\x96 m \xc2\x85']
['', '', 'I']
expected output is
['1', '2', 'G', 'S', 'H f\xc3\xbcr e \xc2\x96 m \xc2\x85','','I']
At the moment, the csv module does not support UTF-16.
In Python 3.x, csv expects a text-mode file and you can simply use the encoding parameter of
opento force another encoding:In Python 2.x, you can recode the input:
openandcodecs.openrequire the file to start with a BOM. If it doesn’t (or you’re on Python 2.x), you can still convert it in memory, like this: