Here is overall code, for taking a tab delimetted text file, and creating a new file that only takes the first two values.
fin = open("in.txt", 'r')
fout = open("out.txt", 'w')
for line in fin:
mrList = line.split('\t')
fout.write(mrList[0] + "\t" + mrList[1])
fout.write('\n')
fin.close()
fout.close()
When this goes in:
Hello world<tab>how are you?<tab>Groovy
Like pie?<tab>I love it<tab>omnomnom
Go pikachu!<tab>Use pound!<tab>She like
This comes out:
Hello world<tab>how are you?䰀椀欀攀 瀀椀攀㼀ऀ䤀 氀漀瘀攀 椀琀ഀ
Go pikachu!<tab>Use pound!
I suspect that ‘\n’ is not quite a newline, and googling it insists “its definitely \n 0_0”
UPDATE:
Since answer below, (thanks!) discovered that on a Linux command line:
file peskyInputFile.txt
Tells you the encoding, and that
iconv -c -f utf-16 -t utf-8 peskyInputFile.txt -o outputFile.txt
will convert a UTF-16 file to UTF-8, which circumvents hassle if you don’t need to deal with UTF16
The file is encoded in UTF-16, and you are attempting to process it like ASCII. When you strip the newline, you only consume one byte, so the UTF-16 is off by one until the next newline. See “Python thinks a 3000-line text file is one line long?” for a solution and explanation.
This is what you’re doing: