Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8606359
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 12, 20262026-06-12T03:04:35+00:00 2026-06-12T03:04:35+00:00

I’m building a web crawler in Ruby, Rails as the front-end. I’m using Mechanize

  • 0

I’m building a web crawler in Ruby, Rails as the front-end. I’m using Mechanize which is built on top of Nokogiri. I already implemented a solution that will crawl the web pages but I looking to be able to crawl 200k websites in a single run and I know there’s a better way than waiting for hours on it to finish. I’m want to be able to achieve the best performance by firing up parallel requests without making it too complex. I don’t know anything about threading and what’s the limit on it so don’t hold the server hostage while the crawler is running if someone would like to point where I can learn how to do this or at least tell me what should I be looking for. Keep in my that I will be writing to the database and to a file (probably I can export form the database once the crawl is done and not write to the file directly). Thanks.

Note: There’s a similar question here in SO but is a few years old maybe people are doing it differently now a days and seems very complex.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-12T03:04:36+00:00Added an answer on June 12, 2026 at 3:04 am

    Look at using Typhoeus and Hydra. They’ll make it easy to process the URLs
    in parallel.

    You don’t need to use Mechanize, unless you have to request special data from
    each page. For a normal crawler you can grab the body and parse it using
    Open::URI and Nokogiri without Mechanize’s overhead or added functionality. For
    your purpose, substitute Typhoeus for Open::URI and let Hydra handle the
    threads management.

    Remember, crawling 200k websites is going to saturate your bandwidth if you try
    to do them all at once. That’ll make your Rails site unavailable, so you need
    to throttle your requests. And, that means you will have to do them over
    several (or many) hours. Speed isn’t as important as keeping your site online
    here. I’d probably put the crawler on a separate machine from the Rails server
    and let the database tie things together.

    Create a table or file that contains the site URLs you are crawling. I’d
    recommend the table so you can put together a form to edit/manage the URLs.
    You’ll want to track things like:

    • The last time a URL was crawled. (DateTime)
    • Whether you should crawl a particular URL (Boolean or char1)
    • URL (String or var char[1024] should be fine). This should be a unique key.
    • Whether that URL is currently being crawled (Boolean or char1). This is cleared at the start of a run for all records, then set and left when a spider goes to load that page.
    • A field showing when what days it’s ok to run that site.
    • A field showing what hours it’s OK to run that site.

    The last two are important. You don’t want to crawl a little site that is
    underpowered and kill its connection. That’s a great way to get banned.

    Create another table that is the next URL to check on a particular site
    gathered from the links you encounter while crawling. You’ll want to come up
    with a normalization routine to reduce a URL with session data and parameters
    to something you can use to test for uniqueness. In this new table you’ll want
    URLs to be unique so you don’t get into a loop or keep adding the same page
    with different parameters.

    You might want to pay attention to the actual landing-URL retrieved after any
    redirects instead of the “get” URL, because redirects and DNS names could vary
    inside a site and the people generating the content could be using different
    host names. Similarly, you might want to look for meta-redirects in the head
    block and follow them. These are a particularly irritating aspect of doing what
    you want to write.

    As you encounter new URLs check to see if they are exiting URLs, that would
    cause you to leave that site if you followed them. If so, don’t add them to
    your URL table.

    It’s probably not going to help to write the database information to files,
    because to locate the right file you’ll probably need to do a database search
    anyway. Just store what you need in a field and request it directly. 200K rows
    is nothing in a database.

    Pay attention to the “spider” rules for sites and if they offer an API to get
    at the data, then use it, instead of crawling.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

That's pretty much it. I'm using Nokogiri to scrape a web page what has
We're building an app, our first using Rails 3, and we're having to build
link Im having trouble converting the html entites into html characters, (&# 8217;) i
I am trying to understand how to use SyndicationItem to display feed which is
I used javascript for loading a picture on my website depending on which small
I have a string like this: La Torre Eiffel paragonata all’Everest What PHP function
I am reading a book about Javascript and jQuery and using one of the
I'm using v2.0 of ClassTextile.php, with the following call: $testimonial_text = $textile->TextileRestricted($_POST['testimonial']); ... and
I would like to run a str_replace or preg_replace which looks for certain words
I'm parsing an RSS feed that has an ’ in it. SimpleXML turns this

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.