I am using URI.unescape to unescape text, unfortunately I run into weird error:
# encoding: utf-8
require('uri')
URI.unescape("%C3%9Fą")
results in
C:/Ruby193/lib/ruby/1.9.1/uri/common.rb:331:in `gsub': incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
from C:/Ruby193/lib/ruby/1.9.1/uri/common.rb:331:in `unescape'
from C:/Ruby193/lib/ruby/1.9.1/uri/common.rb:649:in `unescape'
from exe/fail.rb:3:in `<main>'
why?
The implementation of
URI.unescapeis broken for non-ASCII inputs. The 1.9.3 version looks like this:The regex in use is
/%[a-fA-F\d]{2}/. So it goes through the string looking for a percent sign followed by two hex digits; in the block$&will be the matched text (‘%C3’ for example) and$&[1,2]be the matched text without the leading percent sign ('C3'). Then we callString#hexto convert that hexadecimal number to a Fixnum (195) and wrap it in an Array ([195]) so that we can useArray#packto do the byte mangling for us. The problem is thatpackgives us a single binary byte:The ASCII-8BIT encoding is also known as “binary” (i.e. plain bytes with no particular encoding). Then the block returns that byte and
String#gsubtries to insert into the UTF-8 encoded copy ofstrthatgsubis working on and you get your error:because you can’t (in general) just stuff binary bytes into a UTF-8 string; you can often get away with it:
Things start falling apart once you start mixing non-ASCII data into your URL encoded string.
One simple fix is to switch the string to binary before try to decode it:
Another option is to push the
force_encodinginto the block:I’m not sure why the
gsubfails in some cases but succeeds in others.