I am trying to read text files with Ruby 1.9 and convert them into my own XML structure. I don’t have control over the source text file so they could be in any encoding.
Here is what I do at the moment:
lines = File.readlines(input_file)
lines.each do |line|
#do something
end
I have a problem with a file that contains the é character (xE9). When I try to process the corresponding line I get a Invalid byte sequence in UTF-8 exception when I call .match(...) on the string.
I tried to use the workaround described at Fixing invalid UTF-8 in Ruby, revisited
lines = File.readlines(input_file)
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
lines.each do |line|
unless line.empty?
valid_string = ic.iconv(line + ' ')[0..-2]
#do something
end
end
but this simply strips the é character from the line which is not what I want.
I think the real problem is that the file itself doesn’t seem to be in UTF-8 but uses some ANSI encoding. Although the file is not UTF-8 the resulting line object says it is UTF-8 when calling .encoding; My guess is that I need to use a different way to read the file so that it works for both ANSI and UTF-8 files but I am a Ruby beginner and I really don’t know where to start.
The character is part of the ISO-8859-1 and Win-1252 character sets, among others. The second is probably the most popular character set for Windows, and is your most likely source.
That’s my Ruby version running the following tests. Note that in the following samples the
# encodinglines aren’t comments, they’re directives to Ruby on which character set to use when unencoded binary characters are found:This shows the character in ISO-8859-1:
James Gray did a series of articles a couple years ago about dealing with this stuff. It’s good reading.
Now, back to trying to figure out what character set a character could be in: When you only have one character, because it could be in several sets at once, it is difficult to determine which set it is. If you have more characters >= “\x80” then you can run through the characters sets
iconvsupport and try converting them. That’s messy, but I had to do that in Perl for some screen scraping about five years ago. An alternative is to use the Pythonchardetcode.James Gray’s articles have a link to an article recommending
rchardet.The above routines mention Mozilla’s Charset Detectors, which will give you more info on dealing with this.