I’ve got a module that does webscraping. I use this method a number of times, since it captures all the data on the webpage.
def page_as_xml(uri)
@page_as_xml ||= Nokogiri::HTML(open(uri))
end
Since I’ll use the above method a handful of times for each page, it makes sense to keep it in an instance variable. However, how do I “empty out” the instance variable after I’m done?
All the webcsraping ends up in a hash (see below). If I don’t “empty out” the instance variable, then the same page_as_xml data will get used for each page.
:page1 =>
{
:url => @page1,
:title => download_title(@page1),
:meta_tags => download_robots_tags(@page1)
},
:page2 =>
{
:url => @page2,
:title => download_title(@page2),
:meta_tags => download_robots_tags(@page2)
},
:page3 =>
{
:url => @page3,
:title => download_title(@page3),
:meta_tags => download_robots_tags(@page3)
},
How about make it a hash:
Now you don’t have to worry about emptying it (unless memory is an issue).
I don’t really understand why you need to call it more than once though. Also why do you call it page_as_xml if it is html?