Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7862203
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 2, 20262026-06-02T23:02:23+00:00 2026-06-02T23:02:23+00:00

I began to learn about web crawlers recently and I built a sample crawler

  • 0

I began to learn about web crawlers recently and I built a sample crawler with Ruby, Anemone, and Mongodb for storage. I’m testing the crawler on a massive public website with possibly billions of links.

The crawler.rb is indexing the correct information, although when I check the memory use in activity monitor it shows the memory constantly growing. I have only run the crawler for about 6-7 hours and the memory is showing 1.38GB for mongod and 1.37GB for the Ruby process. It seems to be growing about 100MB every hour or so.

It seems that I might have a memory leak? Is their a more optimal way I can achieve the same crawl without the memory escalating out of control so that it can run longer?

# Sample web_crawler.rb with Anemone, Mongodb and Ruby.

require 'anemone'

# do not store the page's body.
module Anemone
  class Page
    def to_hash
      {'url' => @url.to_s,
       'links' => links.map(&:to_s),
       'code' => @code,
       'visited' => @visited,
       'depth' => @depth,
       'referer' => @referer.to_s,
       'fetched' => @fetched}
    end
    def self.from_hash(hash)
      page = self.new(URI(hash['url']))
      {'@links' => hash['links'].map { |link| URI(link) },
       '@code' => hash['code'].to_i,
       '@visited' => hash['visited'],
       '@depth' => hash['depth'].to_i,
       '@referer' => hash['referer'],
       '@fetched' => hash['fetched']
      }.each do |var, value|
        page.instance_variable_set(var, value)
      end
      page
    end
  end
end


Anemone.crawl("http://www.example.com/", :discard_page_bodies => true, :threads => 1, :obey_robots_txt => true, :user_agent => "Example - Web Crawler", :large_scale_crawl => true) do | anemone |
  anemone.storage = Anemone::Storage.MongoDB

  #only crawl pages that contain /example in url
  anemone.focus_crawl do |page|
    links = page.links.delete_if do |link|
      (link.to_s =~ /example/).nil?
    end
  end

  # only process pages in the /example directory
  anemone.on_pages_like(/example/) do | page |
    regex = /some type of regex/
    example = page.doc.css('#example_div').inner_html.gsub(regex,'') rescue next

    # Save to text file
    if !example.nil? and example != ""
      open('example.txt', 'a') { |f| f.puts "#{example}"}
    end
    page.discard_doc!
  end
end
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-02T23:02:24+00:00Added an answer on June 2, 2026 at 11:02 pm

    I am also having a problem with this, but I am using redis as a datastore.

    this is my crawler:

    require "rubygems"
    
    require "anemone"
    
    urls = File.open("urls.csv")
    opts = {discard_page_bodies: true, skip_query_strings: true, depth_limit:2000, read_timeout: 10} 
    
    File.open("results.csv", "a") do |result_file|
    
      while row = urls.gets
    
        row_ = row.strip.split(',')
        if row_[1].start_with?("http://")
          url = row_[1]
        else
          url = "http://#{row_[1]}"
        end 
        Anemone.crawl(url, options = opts) do |anemone|
          anemone.storage = Anemone::Storage.Redis
          puts "crawling #{url}"    
          anemone.on_every_page do |page| 
    
            next if page.body == nil 
    
            if page.body.downcase.include?("sometext")
              puts "found one at #{url}"     
              result_file.puts "#{row_[0]},#{row_[1]}"
              next
    
            end # end if 
    
          end # end on_every_page
    
        end # end crawl
    
      end # end while
    
      # we're done
      puts "We're done."
    
    end # end File.open
    

    I applied the patch from here to my core.rb file in the anemone gem:

    35       # Prevent page_queue from using excessive RAM. Can indirectly limit ra    te of crawling. You'll additionally want to use discard_page_bodies and/or a     non-memory 'storage' option
    36       :max_page_queue_size => 100,
    

    …

    (The following used to be on line 155)

    157       page_queue = SizedQueue.new(@opts[:max_page_queue_size])
    

    and I have an hourly cron job doing:

    #!/usr/bin/env python
    import redis
    r = redis.Redis()
    r.flushall()
    

    to try and keep redis’ memory usage down. I’m restarting a giant crawl now, so we’ll see how it goes!

    I’ll report back with results…

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm trying to learn about reverse engineering, using Minesweeper as a sample application. I've
I am implementing my own RSA algorithm in ruby to learn more about the
I only began learning about web development 3 weeks ago and have grasped html,
I have just began recently to learn Haskell, more specifically on the topics of
I began learning Python a few days ago, and i was wondering about a
I began studying Java Regular Expression recently and I found some really intersting task.For
I just began reading more about Markov Chain Generators today, and am really intrigued
I recently began working on a large project that contains a huge number of
I am an experienced Java developer who is trying to learn web development with
I've recently caught the FP bug (trying to learn Haskell), and I've been really

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.