I have to read xml files that are accessibles through http with authentication. That’s why I use mechanize.
My problem is that I can’t get mechanize to recognize these XML files so I can use .find or .search on them.
Here is what I tried first – in my view (html file)
<% agent = Mechanize.new %>
<% page = agent.get("http://dl.dropbox.com/u/344349/xml.xml") %>
<%= page %>
Which returns #<Mechanize::File:0x007f9dd602de30>. It’s ::File and not ::Page
I can’t use a .find or .search on this as it’ll error with undefined method find for #<Mechanize::File:0x007f9dd624cbd0>
Mechanize doc says : This is the default (and base) class for the Pluggable Parsers. If Mechanize cannot find an appropriate class to use for the content type, this class will be used. For example, if you download a JPG, Mechanize will not know how to parse it, so this class will be instantiated.
So I created a class as described here : http://rdoc.info/github/tenderlove/mechanize/master/Mechanize/PluggableParser
My class
class XMLParser < Mechanize::File
attr_reader :xml
def initialize(uri=nil, response=nil, body=nil, code=nil)
super(uri, response, body, code)
@xml = xml.parse(body)
end
end
and the updated code in my view (html file)
<% agent = Mechanize.new %>
<% agent.pluggable_parser['text/xml'] = XMLParser %>
<% agent.user_agent_alias = 'Windows Mozilla' %>
<% page = agent.get("http://dl.dropbox.com/u/344349/xml.xml") %>
<%= page %>
or even
<% agent = Mechanize.new %>
<% agent.pluggable_parser.xml = XMLParser %>
<% page1 = agent.get('http://dl.dropbox.com/u/344349/xml.xml') # => CSVParser %>
<%= page1 %>
Still returns #<Mechanize::File:0x007f9dd5253b48>
I even tested the exact code (CSVParser – http://rdoc.info/github/tenderlove/mechanize/master/Mechanize/PluggableParser) and tried loading a csv file that is still seen as a ::File.
What am I doing wrong ?
Okay, so I’ve resolved this problem for myself just now. The solution is in two parts:
First, the content type you are matching is incorrect. If you run this line, after you do your get, it will tell you what the content type is for the document you are getting:
When I use mechanize to get your page (‘http://dl.dropbox.com/u/344349/xml.xml’), I see ‘application/xml’ as the content type.
Second, you’re not using PluggableParser correctly. Using XMLParser as you have it here will generate
NoMethodError: undefined method 'parse' for nil:NilClass. Change the class definition to use Nokogiri::XML instead:Then, set this as the parser for the correct content type:
To use this, you’ll get your page the same as before, and then reference the xml attribute of the page object as a Nokogiri::XML::Document instance, which is a subclass of Nokogiri::XML::Node. Fortunately, Mechanize::Page.search is just a wrapper around Nokogiri::XML::Node.search, so you can search the same way you expect, pretty much. Like this:
A further refinement would be to map XmlParser.search to the Nokogiri .search methods:
This lets you perform your searches directly on the page instance: