I want to traverse some HTML documents with Nokogiri. After getting the XML object,

Question

0

Asked: June 18, 20262026-06-18T02:15:36+00:00 2026-06-18T02:15:36+00:00

I want to traverse some HTML documents with Nokogiri. After getting the XML object,

0

I want to traverse some HTML documents with Nokogiri.
After getting the XML object, I want to have the last URL used by Nokogiri that fetched a document to be part of my JSON response.

def url = "http://ow.ly/hh8ri"     
doc = Nokogiri::HTML(open(url)
...

Nokogiri internally redirects it to http://www.mp.rs.gov.br/imprensa/noticias/id30979.html, but I want to have access to it.

I want to know if the “doc” object has access to some URL as attribute or something.
Does someone know a workaround?

By the way, I want the full URL because I’m traversing the HTML to find <img> tags and some have relative ones like: “/media/image/image.png”, and then I adjust some using:

URI.join(url, relative_link_url).to_s

The image URL should be:

http://www.mp.rs.gov.br/media/imprensa/2013/01/30979_260_260__trytr.jpg

Instead of:

http://ow.ly/hh8ri/media/imprensa/2013/01/30979_260_260__trytr.jpg

EDIT: IDEA

class Scraper < Nokogiri::HTML::Document
  attr_accessor :url

  class << self

    def new(url)
        html = open(url, ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE)
        self.parse(html).tap do |d|
            url = URI.parse(url)
            response = Net::HTTP.new(url.host, url.port)
            head = response.start do |r|
              r.head url.path
            end 
            d.url = head['location']
        end
    end
  end
end

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-18T02:15:37+00:00

Editorial Team

2026-06-18T02:15:37+00:00Added an answer on June 18, 2026 at 2:15 am

Use Mechanize. The URLs will always be converted to absolute:

require 'mechanize'
agent = Mechanize.new
page = agent.get 'http://ow.ly/hh8ri'
page.images.map{|i| i.url.to_s}
#=> ["http://www.mp.rs.gov.br/images/imprensa/barra_area.gif", "http://www.mp.rs.gov.br/media/imprensa/2013/01/30979_260_260__trytr.jpg"]

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I want to traverse some HTML documents with Nokogiri. After getting the XML object,

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply