I use Nokogiri to parse an html. I need both the content and image tags in the page, so I use inner_html instead of content method. But the value returned by content is encoded correct, while wrongly encoded by inner_html. One note, the page is in Chinese and not use UTF-8 encoding.
Here is my code:
# encoding: utf-8
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'iconv'
doc = Nokogiri::HTML.parse(open("http://www.sfzt.org/advise/view.asp?id=536"), nil, 'gb18030')
doc.css('td.font_info').each do |link|
# output, correct but not i expect: 目前市面上影响比
puts link.content
# output, wrong and not i expect: <img ....></img>Ŀǰ??????Ӱ??Ƚϴ?Ľ????
# I expect: <img ....></img>目前市面上影响比
puts link.inner_html
end
That is written on the ‘Encoding’ section on README: http://nokogiri.org/
So, you should convert
inner_htmlstring manually if you want to get it as UTF-8 string: