I am doing one the examples at the mechanize doc site and I want to parse the results using
nokogiri.
My problem is that when the following line gets executed:
doc = Nokogiri::HTML(search_results, 'UTF-8' )
the following error occurs:
C:/Ruby192/lib/ruby/gems/1.9.1/gems/nokogiri-1.4.4.1-x86-mingw32/lib/nokogiri/html/document.rb:71:in `parse': undefined method `name' for "UTF-8":String (NoMethodError)
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/nokogiri-1.4.4.1-x86-mingw32/lib/nokogiri/html.rb:13:in `HTML'
from mechanize_test.rb:16:in `<main>'
I have installed ruby 1.9 on a windows vista machine
The results returned by mechanize are non-latin (utf8)
The code sample follows.
# encoding: UTF-8
require 'rubygems'
require 'mechanize'
require 'nokogiri'
agent = Mechanize.new
agent.user_agent_alias = 'Mac Safari'
page = agent.get("http://www.google.com/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = "invitations"
search_results = agent.submit(search_form)
puts search_results.body
doc = Nokogiri::HTML(search_results, 'UTF-8')
This appears to be issue with what Nokogiri expects as parameters to the parse method that is being called. The first issue I see, is that you are passing in the encoding option in the wrong parameter slot,
A parsing example from Nokogiri project page that specifies encoding
Notice the encoding is the third parameter, not the second. But that still does not fully explain the behavior you are seeing, as the encoding should simply be ignored.
Per the Nokogiri documentation a call to Nokogiri::HTML() is a convenience method for the parse method.
Code for Nokogiri::HTML::parse
The source for the Nokogiri::HTML::Document parse method is a bit long, but here is the relevant part though:
Notice string_or_io.encoding.name, this matches the error your saw, undefined method ‘name’ for “UTF-8”:String (NoMethodError).
Does your search_results object has an attribute with a key value pair of {:encoding => ‘UTF-8’}? It appears Nokogiri is looking for the encoding to store an object that then has a name attribute of ‘UTF-8’.