I want to traverse some HTML documents with Nokogiri.
After getting the XML object, I want to have the last URL used by Nokogiri that fetched a document to be part of my JSON response.
def url = "http://ow.ly/hh8ri"
doc = Nokogiri::HTML(open(url)
...
Nokogiri internally redirects it to http://www.mp.rs.gov.br/imprensa/noticias/id30979.html, but I want to have access to it.
I want to know if the “doc” object has access to some URL as attribute or something.
Does someone know a workaround?
By the way, I want the full URL because I’m traversing the HTML to find <img> tags and some have relative ones like: “/media/image/image.png”, and then I adjust some using:
URI.join(url, relative_link_url).to_s
The image URL should be:
http://www.mp.rs.gov.br/media/imprensa/2013/01/30979_260_260__trytr.jpg
Instead of:
http://ow.ly/hh8ri/media/imprensa/2013/01/30979_260_260__trytr.jpg
EDIT: IDEA
class Scraper < Nokogiri::HTML::Document
attr_accessor :url
class << self
def new(url)
html = open(url, ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE)
self.parse(html).tap do |d|
url = URI.parse(url)
response = Net::HTTP.new(url.host, url.port)
head = response.start do |r|
r.head url.path
end
d.url = head['location']
end
end
end
end
Use Mechanize. The URLs will always be converted to absolute: