I am currently using the following code in my Product model to read and save the og:images of retail sites.
def photo_from_url(url)
if !Nokogiri::HTML(open(url)).css("meta[property='og:image']").blank?
photo_url = Nokogiri::HTML(open(url)).css("meta[property='og:image']").first.attributes["content"]
self.photo = URI.parse(photo_url)
self.save
end
end
While this works on most pages, there are some og:images that return bad URI(is not URI?)
An example of such a link is the following link format at H&M’s retail site.
http://lp.hm.com/hmprod?set=source[/model/2012/K71 05701 95313 06 0043 0.jpg],rotate[],width[],height[],x[],y[],type[STILL_LIFE_FRONT]&call=url[file:/product/facebook]
Obviously, this isn’t a pretty link (even StackOverflow’s Markdown parser can’t tell that it’s a link…), but it does actually work when pasted directly into a browser.
What can I do to correctly read a link like this?
Whoa, that looks like a nasty URL. Nice URL scheme notwithstanding, I suggest you simply escape your URLs using
URI::Escape: