I’m importing a CSV file into Ruby (1.8.7). File.open(‘path/to/file.csv’).read returns this in the console:
Stefan,Engstr\232m
The encoding is identified as iso-8859-2 by UniversalDetector (chardet gem).
UniversalDetector::chardet("Stefan,Engstr\232m")
=> {"confidence"=>0.626936305574385, "encoding"=>"ISO-8859-2"}
Trying to convert the string yields the following:
Iconv.conv("UTF-8", "ISO-8859-2", "Stefan,Engstr\232m")
=> "Stefan,Engstrm"
whereas I would expect:
=> "Stefan,Engström"
- Could the string really be in some other encoding?
- I haven’t seen the \232 syntax before, usually when strings are strangely encoded some weird character will show up instead, e.g. � or some chinese.
Let me know if I should provide more information or elaborate on something.
The encoding is probably “Macintosh Roman”, a couple other options would be “Mac Central European” and “Mac Icelandic”. The
\nnnnotation uses octal so\232is 154 in decimal and character 154 is the lower case O-umlaut (“ö”) that you’re expecting in all three of those encodings; I don’t see 154 in any of the Windows codepages or ISO 8859 character sets. I’d guess that Mac Roman is more common than the Icelandic or Central European encodings.Try using
'MacRoman'as your source encoding with Iconv: