In IRB, I’m trying the following:
1.9.3p194 :001 > foo = "\xBF".encode("utf-8", :invalid => :replace, :undef => :replace)
=> "\xBF"
1.9.3p194 :002 > foo.match /foo/
ArgumentError: invalid byte sequence in UTF-8
from (irb):2:in `match'
Any ideas what’s going wrong?
I’d guess that
"\xBF"already thinks it is encoded in UTF-8 so when you callencode, it thinks you’re trying to encode a UTF-8 string in UTF-8 and does nothing:\xBFisn’t valid UTF-8 so this is, of course, nonsense. But if you use the three argument form ofencode:You can force the issue by telling
encodeto ignore what the string thinks its encoding is and treat it as binary data:Where
sis the"\xBF"that thinks it is UTF-8 from above.You could also use
force_encodingonsto force it to be binary and then use the two-argumentencode: