I’m retrieving an HTML document that is parsed with Nokogiri. The HTML is using

Question

0

Asked: May 28, 20262026-05-28T01:34:46+00:00 2026-05-28T01:34:46+00:00

I’m retrieving an HTML document that is parsed with Nokogiri. The HTML is using

0

I’m retrieving an HTML document that is parsed with Nokogiri. The HTML is using charset ISO-8859-1. The problem is there are some Unicode chars in the document which are converted to Unicode code points instead of their respective character.

For example, this is some text in the HTML as received (in ISO-8859-1):

\x95\x95 JOHNNY VENETTI \x95\x95

And when attempting to work with this text, it gets converted to this:

\u0095\u0095 JOHNNY VENETTI \u0095\u0095

So my question is, how can I ensure those characters are represented as their appropriate character instead of the code point? I’ve tried doing a gsub on the text, but that seems wrong for this. Also, I do not have control over the encoding of the HTML document.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-28T01:34:47+00:00

First you should realize that this string is NOT ISO-8859-1 encoded (file says "Non-ISO extended-ASCII text" and the codepage verifies this). May well be this is your problem, in that case you should specify the right encoding (probably something like Windows-1252, in this case) in your HTML document.

In Nokogiri, you can also set the encoding explicitly in cases where the document specifies the wrong encoding:

Nokogiri.HTML("<p>\x95\x95 JOHNNY VENETTI \x95\x95</p>", nil, "Windows-1252")
# => #<Nokogiri::HTML::Document: ... 
#       children=[#<Nokogiri::XML::Text:0x15744cc "•• JOHNNY VENETTI ••">]>]>]>]>

If you don’t have the option to solve this cleanly like above, you can also do it the hard way and associated the string with its correct encoding:

s = "\x95\x95 JOHNNY VENETTI \x95\x95"
s.encoding # => #<Encoding:ASCII-8BIT>
s.force_encoding 'Windows-1252'
s.encode! 'utf-8'
s # => "•• JOHNNY VENETTI ••"

Note that this last piece of code is Ruby 1.9 only. If you want, you can read more about the new encoding system in Ruby 1.9.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m retrieving an HTML document that is parsed with Nokogiri. The HTML is using

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply