I get sources from the web and sometimes the encoding of the material is

Question

0

Asked: June 11, 20262026-06-11T20:47:52+00:00 2026-06-11T20:47:52+00:00

I get sources from the web and sometimes the encoding of the material is

0

I get sources from the web and sometimes the encoding of the material is not 100% UTF8 byte sequence valid. I use iconv to silently ignore these sequences to get a cleaned string.

@iconv = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = @iconv.iconv(untrusted_string)

However now the iconv has been deprecated, I see its deprecation warning a lot.

iconv will be deprecated in the future, use String#encode

I tried the converting it, using String#encode‘s :invalid and :replace options, but it seems not to be working (i.e. the incorrect byte sequence has not been removed). What is the correct way to use String#encode for this?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-11T20:47:53+00:00

The question that Martijn linked to has what seem to be the two best ways to do that, but Martijn made an understandable but incorrect change when copying the second approach to his answer here. Doing .encode(‘UTF-8’, <options>).encode(‘UTF-8’) doesn’t work. As indicated in the original answer in the other question, the key is to encode to a different encoding, then back to UTF-8. If your original string is already flagged as UTF-8 in ruby’s internals then ruby will ignore any call to encode it as UTF-8.

In the following examples I’m going to use “a#{0xFF.chr}b”.force_encoding(‘UTF-8’) to produce a string that ruby believes is UTF-8 but which contains invalid UTF-8 bytes.

1.9.3p194 :019 > "a#{0xFF.chr}b".force_encoding('UTF-8')
 => "a\xFFb" 
1.9.3p194 :020 > "#{0xFF.chr}".force_encoding('UTF-8').encoding
 => #<Encoding:UTF-8>

Note how encoding to UTF-8 does nothing:

1.9.3p194 :016 > "a#{0xFF.chr}b".force_encoding('UTF-8').encode('UTF-8', :invalid => :replace, :replace => '').encode('UTF-8')
 => "a\xFFb"

But encoding to something else (UTF-16) and then back to UTF-8 cleans up the string:

1.9.3p194 :017 > "a#{0xFF.chr}b".force_encoding('UTF-8').encode('UTF-16', :invalid => :replace, :replace => '').encode('UTF-8')
 => "ab"

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I get sources from the web and sometimes the encoding of the material is

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply