I use Nokogiri to parse an html. I need both the content and image

Question

0

Asked: May 28, 20262026-05-28T01:55:20+00:00 2026-05-28T01:55:20+00:00

I use Nokogiri to parse an html. I need both the content and image

0

I use Nokogiri to parse an html. I need both the content and image tags in the page, so I use inner_html instead of content method. But the value returned by content is encoded correct, while wrongly encoded by inner_html. One note, the page is in Chinese and not use UTF-8 encoding.

Here is my code:

# encoding: utf-8
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'iconv'

doc = Nokogiri::HTML.parse(open("http://www.sfzt.org/advise/view.asp?id=536"), nil, 'gb18030')

doc.css('td.font_info').each do |link|
  # output, correct but not i expect: 目前市面上影响比
  puts link.content

  # output, wrong and not i expect: <img ....></img>Ŀǰ??????Ӱ??Ƚϴ?Ľ????
  # I expect: <img ....></img>目前市面上影响比
  puts link.inner_html
end

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-28T01:55:21+00:00

That is written on the ‘Encoding’ section on README: http://nokogiri.org/

Strings are always stored as UTF-8 internally. Methods that return
text values will always return UTF-8 encoded strings. Methods that
return XML (like to_xml, to_html and inner_html) will return a string
encoded like the source document.

So, you should convert inner_html string manually if you want to get it as UTF-8 string:

puts link.inner_html.encode('utf-8') # for 1.9.x

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I use Nokogiri to parse an html. I need both the content and image

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply