I have been building a scraper and spider in php when I hit this design question. I was wondering about the trade offs between making a system which separates the crawling and scraping tasks (as most professional systems seem to do) and one that scrapes as the spider crawls. The only thing I could think of is that by splitting it up and using a queue, you could better parallelize the task by having several scrapers that just need to ask the queue what is the next page to scrape. Can anyone think of other trade offs and explain to me the main reason that these are normally separated into two programs?
Note: the order of the crawling is the same in both cases, the only difference is when the page gets pulled.
A crawler retrieves pages, and a spider processes them. If you keep these tasks separate you can change the implementation of one task without changing the other. This is why they are separated: it is simply good software design.
The example you give is a good one: if you combine retrieval with processing in a single class/module/program/function/whatever, any change in how pages are retrieved (e.g., parallel retrieval, retrieval through a proxy, etc) requires rewriting the entire program.
Here’s another one: if you want to process a different kind of data (e.g. rss feeds instead of html pages), you need to write your entire scraper from scratch and you cannot reuse any work you did on page retrieval.