I use mechanize gem to crawl websites. I wrote a very simple, one-threaded crawler inside a Rails rake task because I needed to access to Rails models.
The crawler runs just fine, but after watching it running for a while I can see that it eats more and more RAM over time, which is bad.
I use God gem to monitor my crawler.
Below is my rake task code, I’m wondering if it exposes any chance of memory leaking?
task :abc => :environment do
prefix_url = 'http://example.com/abc-'
postfix_url = '.html'
from_page_id = (AppConfig.last_crawled_id || 1) + 1
to_page_id = 100000
agent = Mechanize.new
agent.user_agent_alias = 'Mac Safari'
(from_page_id..to_page_id).each do |i|
url = "#{prefix_url}#{i}#{postfix_url}"
puts "#{Time.now} - Crawl #{url}"
page = agent.get(url)
page.search('#content > ul').each do |s|
var = s.css('li')[0].text()
value = s.css('li')[1].text()
MyModel.create :var => var, :value => value
end
AppConfig.last_crawled_id = i
end
# Finish crawling, let's stop
`god stop crawl_abc`
end
Unless you’ve got the very latest version of mechanize (2.1.1 was released only a day or so ago) by default mechanize operates with an unlimited history size, ie it keeps all the pages you visited and so will gradually use more and more memory.
In your case there isn’t any point to this, so calling
max_history=on your agent should limit how much memory is used in this fashion