I am having trouble handling text files of tabulated data generated on a windows machine.
I’m working in Ruby 1.8. The following gives an error (“\000” (Iconv::InvalidCharacter)) when processing the SECOND line from the file. The first line is converted properly.
require 'iconv'
conv = Iconv.new("UTF-8//IGNORE","UTF-16")
infile = File.open(tabfile, "r")
while (line = infile.gets)
line = conv.iconv(line.strip) # FAILS HERE
puts line
# DO MORE STUFF HERE
end
The strange thing is that it reads and converts the first line in the file with no problem.
I have the //IGNORE flag in the Iconv constructor — I thought this was supposed to suppress this kind of error.
I’ve been going in circles for a while. Any advice would be highly appreciated.
Thanks!
EDIT:
hobbs solution fixes this. Thank you.
Simply change the code to:
require 'iconv'
conv = Iconv.new("UTF-8//IGNORE","UTF-16")
infile = File.open(tabfile, "r")
while (line = infile.gets("\x0a\x00"))
line = conv.iconv(line.strip) # NO LONGER FAILS HERE
# DOES MORE STUFF HERE
end
Now I’ll just need to find a way to automatically determine which gets separator to use.
The error message is pretty vague, but I think it’s unhappy about the fact that it’s found an odd number of bytes on a line, since every character in UTF-16 is two (or occasionally four) bytes. And I think the reason for that is your use of
gets— the lines in your file are separated by a UTF-16le newline, which is0x0a 0x00, butgetsis splitting on (andstripis removing)0x0aonly.To illustrate: suppose the file contains
encoded in UTF-16le. That’s
getsreads up to the first0x0a, whichstripremoves, so the first line read is0x61 0x00 0x62 0x00, which iconv happily accepts and encodes to UTF-8 as0x61 0x62— “ab”.getsthen reads up to the next0x0a, whichstripagain removes, so the second timelinegets0x00 0x63 0x00 0x64 0x00and now everything is screwed up — we’re out of sync by one byte and there’s an odd number of bytes to convert, andiconvblows up because that’s incompatible with what you asked it to do.Absent an actual working file encoding/decoding layer, I think what you want is to change the
getsseparator from"\n"("\x0a") to"\x0a\x00", abandon all use ofstripsince it’s not encoding-clean, and useprintinstead ofputsso that you don’t add extra line-ends (since you’ll be converting the ones you’ve already got).If you’re working with windows files, a windows CRLF in UTF-16le is
"\x0d\x00\x0a\x00".