I’m parsing a Reddit RSS feed with Nokogiri for a certain subreddit.
I’m trying to capture the external URL of the post if it goes to a certain domain.
Unfortunately, even if the post created by the user links to an external website, all of the RSS titles go to that reddit post (comment area) regardless. There is one attribute called description however, generated by the Reddit RSS feed, which DOES include an HTML string that includes two links:
[link][2 comments]
It is always the second to last anchor in the description.text
With Nokogiri, I can get down to the part where I pull the entire description into a string, and then I instantiate a new Nokogiri::HTML object with this string.
I’m wondering two things:
- Is there a way to convert a string to
Nokogiri::HTMLso I dont need to create a new object? - How do I save the
hrefvalue for the second to last link which appears in the description?
Code:
def scrape
@document = Nokogiri::XML(open(self.url))
@document.xpath("//item").each do |item|
description_html = item.xpath('description').text
url = Nokogiri::HTML(description_html)... #?
end
end
Figured it out