I’m trying to extract each a href link on an html page for evaluation w/ nokogiri and xpath. What I have so far seems to be pulling the page titles out only. I’m not interested in the link title, but rather just the URL that is being pointed to.
Here’s what I have:
doc = Nokogiri::HTML(open("http://www.cnn.com"))
doc.xpath('//a').each do |node|
puts node.text
end
Can anyone guide me on how to correct this so that I’m pulling the actual href instead of the text itself?
Your XPATH of //a is pulling back all elements. Which includes the text content. You can use @attrname to access attributes. For example
Will get you the href of every a in the document