I’m building a web crawler in Ruby, Rails as the front-end. I’m using Mechanize which is built on top of Nokogiri. I already implemented a solution that will crawl the web pages but I looking to be able to crawl 200k websites in a single run and I know there’s a better way than waiting for hours on it to finish. I’m want to be able to achieve the best performance by firing up parallel requests without making it too complex. I don’t know anything about threading and what’s the limit on it so don’t hold the server hostage while the crawler is running if someone would like to point where I can learn how to do this or at least tell me what should I be looking for. Keep in my that I will be writing to the database and to a file (probably I can export form the database once the crawl is done and not write to the file directly). Thanks.
Note: There’s a similar question here in SO but is a few years old maybe people are doing it differently now a days and seems very complex.
Look at using Typhoeus and Hydra. They’ll make it easy to process the URLs
in parallel.
You don’t need to use Mechanize, unless you have to request special data from
each page. For a normal crawler you can grab the body and parse it using
Open::URI and Nokogiri without Mechanize’s overhead or added functionality. For
your purpose, substitute Typhoeus for Open::URI and let Hydra handle the
threads management.
Remember, crawling 200k websites is going to saturate your bandwidth if you try
to do them all at once. That’ll make your Rails site unavailable, so you need
to throttle your requests. And, that means you will have to do them over
several (or many) hours. Speed isn’t as important as keeping your site online
here. I’d probably put the crawler on a separate machine from the Rails server
and let the database tie things together.
Create a table or file that contains the site URLs you are crawling. I’d
recommend the table so you can put together a form to edit/manage the URLs.
You’ll want to track things like:
The last two are important. You don’t want to crawl a little site that is
underpowered and kill its connection. That’s a great way to get banned.
Create another table that is the next URL to check on a particular site
gathered from the links you encounter while crawling. You’ll want to come up
with a normalization routine to reduce a URL with session data and parameters
to something you can use to test for uniqueness. In this new table you’ll want
URLs to be unique so you don’t get into a loop or keep adding the same page
with different parameters.
You might want to pay attention to the actual landing-URL retrieved after any
redirects instead of the “get” URL, because redirects and DNS names could vary
inside a site and the people generating the content could be using different
host names. Similarly, you might want to look for meta-redirects in the head
block and follow them. These are a particularly irritating aspect of doing what
you want to write.
As you encounter new URLs check to see if they are exiting URLs, that would
cause you to leave that site if you followed them. If so, don’t add them to
your URL table.
It’s probably not going to help to write the database information to files,
because to locate the right file you’ll probably need to do a database search
anyway. Just store what you need in a field and request it directly. 200K rows
is nothing in a database.
Pay attention to the “spider” rules for sites and if they offer an API to get
at the data, then use it, instead of crawling.