I’m trying to retrieve a Web page, and apply a simple regular expression on

Question

0

Asked: May 23, 20262026-05-23T04:17:10+00:00 2026-05-23T04:17:10+00:00

I’m trying to retrieve a Web page, and apply a simple regular expression on

0

I’m trying to retrieve a Web page, and apply a simple regular expression on it.
Some Web pages contain non-UTF-8 characters, even though UTF-8 is claimed in Content-Type (example). In these cases I get:

ArgumentError (invalid byte sequence in UTF-8)

I’ve tried to use the following methods for sanitizing bad characters, but none of them helped to solve the issue:

content = Iconv.conv("UTF-8//IGNORE", "UTF-8", content)
content.encode!("UTF-8", :illegal => :replace, :undef => :replace, :replace => "?")

Here’s the complete code:

response = Net::HTTP.get_response(url)
@encoding = detect_encoding(response) # Detects encoding using Content-Type or meta charset HTML tag
if (@encoding)
  @content =response.body.force_encoding(@encoding)
  @content = Iconv.conv(@encoding + '//IGNORE', @encoding, @content);
else
  @content = response.body
end

@content.gsub!(/.../, "") # bang

Is there a way to deal with this issue? Basically, what I need is to set the base URL meta tag, and inject some Javascripts into the retrieved Web page.

Thanks!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T04:17:11+00:00

I had a similar problem importing emails with different encodings, I ended with this:

def enforce_utf8(from = nil)
  begin
    self.is_utf8? ? self : Iconv.iconv('utf8', from, self).first
  rescue
    converter = Iconv.new('UTF-8//IGNORE//TRANSLIT', 'ASCII//IGNORE//TRANSLIT') 
    converter.iconv(self).unpack('U*').select{ |cp| cp < 127 }.pack('U*')
  end
end

at first, it tries to convert from *some_format* to UTF-8, in case there isn’t any encoding or Iconv fails for some reason, then apply a strong conversion (ignore errors, translit chars and strip non recognized chars).

let me know if it works for you 😉

A.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to retrieve a Web page, and apply a simple regular expression on

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply