I have to convert Latin chars like éáéíóúÀÉÍÓÚ etc., into a string to similar ones without special accents or wired symbols:
é -> e
è -> e
Ä -> A
I have a file named “test.rb”:
require 'iconv'
puts Iconv.iconv("ASCII//translit", "utf-8", 'è').join
When I paste those lines into irb it works, returning “e” as expected.
Running:
$ ruby test.rb
I get “?” as output.
I’m using irb 0.9.5(05/04/13) and Ruby 1.8.7 (2011-06-30 patchlevel 352) [i386-linux].
Ruby 1.8.7 was not multibyte character savvy like 1.9+ is. In general, it treats a string as a series of bytes, rather than characters. If you need better handling of such characters, consider upgrading to 1.9+.
James Gray has a series of articles about dealing with multibyte characters in Ruby 1.8. I highly recommend taking the time to read through them. It’s a complex subject so you’ll want to read the entire series he wrote a couple times.
Also, 1.8 encoding support needs the
$KCODEflag set:so you’ll need to add that to code running in 1.8.
Here is a bit of sample code:
Using ruby 1.8.7 (2011-06-30 patchlevel 352) [x86_64-darwin10.7.0] and running it in IRB, I get:
At line 9 in the output I told Ruby to split the line into its concept of characters, which in 1.8.7, was bytes. The resulting ‘?’ mean it didn’t know what to do with the output. A line 10 I told it to split, which resulted in an array of bytes, which
jointhen reassembled into the normal string, allowing the multibyte characters to be translated normally.Running the same code using Ruby 1.9.2 shows better, and more expected and desirable, behavior:
Ruby maintained the multibyte-ness of the characters, through the
split('').Notice that in both cases,
Iconv.iconvdid the right thing, it created characters that were visually similar to the input characters. While the leading apostrophe looks out of place, it’s there as a reminder the characters were accented originally.For more information, see the links on the right to related questions or try this SO search for
[ruby] [iconv]