I’m retrieving an HTML document that is parsed with Nokogiri. The HTML is using charset ISO-8859-1. The problem is there are some Unicode chars in the document which are converted to Unicode code points instead of their respective character.
For example, this is some text in the HTML as received (in ISO-8859-1):
\x95\x95 JOHNNY VENETTI \x95\x95
And when attempting to work with this text, it gets converted to this:
\u0095\u0095 JOHNNY VENETTI \u0095\u0095
So my question is, how can I ensure those characters are represented as their appropriate character instead of the code point? I’ve tried doing a gsub on the text, but that seems wrong for this. Also, I do not have control over the encoding of the HTML document.
First you should realize that this string is NOT ISO-8859-1 encoded (
filesays"Non-ISO extended-ASCII text"and the codepage verifies this). May well be this is your problem, in that case you should specify the right encoding (probably something like Windows-1252, in this case) in your HTML document.In Nokogiri, you can also set the encoding explicitly in cases where the document specifies the wrong encoding:
If you don’t have the option to solve this cleanly like above, you can also do it the hard way and associated the string with its correct encoding:
Note that this last piece of code is Ruby 1.9 only. If you want, you can read more about the new encoding system in Ruby 1.9.