I’m trying to retrieve a Web page, and apply a simple regular expression on it.
Some Web pages contain non-UTF-8 characters, even though UTF-8 is claimed in Content-Type (example). In these cases I get:
ArgumentError (invalid byte sequence in UTF-8)
I’ve tried to use the following methods for sanitizing bad characters, but none of them helped to solve the issue:
content = Iconv.conv("UTF-8//IGNORE", "UTF-8", content)content.encode!("UTF-8", :illegal => :replace, :undef => :replace, :replace => "?")
Here’s the complete code:
response = Net::HTTP.get_response(url)
@encoding = detect_encoding(response) # Detects encoding using Content-Type or meta charset HTML tag
if (@encoding)
@content =response.body.force_encoding(@encoding)
@content = Iconv.conv(@encoding + '//IGNORE', @encoding, @content);
else
@content = response.body
end
@content.gsub!(/.../, "") # bang
Is there a way to deal with this issue? Basically, what I need is to set the base URL meta tag, and inject some Javascripts into the retrieved Web page.
Thanks!
I had a similar problem importing emails with different encodings, I ended with this:
at first, it tries to convert from *some_format* to UTF-8, in case there isn’t any encoding or Iconv fails for some reason, then apply a strong conversion (ignore errors, translit chars and strip non recognized chars).
let me know if it works for you 😉
A.