I have a two part question.
First, I’m writing a web-scraper based on the CrawlSpider spider in Scrapy. I’m aiming to scrape a website that has many thousands (possible into the hundreds of thousands) of records. These records are buried 2-3 layers down from the start page. So basically I have the spider start on a certain page, crawl until it finds a specific type of record, and then parse the html. What I’m wondering is what methods exist to prevent my spider from overloading the site? Is there possibly a way to do thing’s incrementally or put a pause in between different requests?
Second, and related, is there a method with Scrapy to test a crawler without placing undue stress on a site? I know you can kill the program while it runs, but is there a way to make the script stop after hitting something like the first page that has the information I want to scrape?
Any advice or resources would be greatly appreciated.
I’m using Scrapy caching ability to scrape site incrementaly
HTTPCACHE_ENABLED = True
Or you can use new 0.14 feature Jobs: pausing and resuming crawls
check this settings:
You can try and debug your code in Scrapy shell
Also, you can call scrapy.shell.inspect_response at any time in your spider.
Scrapy documentation is the best resource.