I’m running Ruby 1.9.2 and trying to fix some broken UTF-8 text input where the text is literally "\\354\\203\\201\\355\\221\\234\\353\\252\\205" and change it into its correct Korean "상표명"
However after searching for a while and trying a few methods I still get out gibberish.
It’s confusing as the escaped characters example on line 3 works fine
# encoding: utf-8
puts "상표명" # Target string
# Output: "상표명"
puts "\354\203\201\355\221\234\353\252\205" # Works with escaped characters like this
# Output: "상표명"
# Real input is a string
input = "\\354\\203\\201\\355\\221\\234\\353\\252\\205"
# After some manipulation got it into an array of numbers
puts [354, 203,201,355,221,234,353,252,205].pack('U*').force_encoding('UTF-8')
# Output: ŢËÉţÝêšüÍ (gibberish)
I’m sure this must have been answered somewhere but I haven’t managed to find it.
This is what you want to do to get your UTF-8 Korean text:
And this is how it works:
scanto pull out the individual number.mapwithto_i(8)to convert the octal values (as noted by Henning Makholm) to integers.pack('C*')to get a byte string. This string will have theBINARYencoding (AKAASCII-8BIT).force_encoding('utf-8').The main thing that you were missing was your
packformat;'U'means “UTF-8 character” and would expect an array of Unicode codepoints each represented by a single integer,'C'expects an array of bytes and that’s what we had.