I’m trying to test Scrapy for crawling web pages and I do not understand why my crawler is crawling only one page, I’ve tried to comment both rules and allowed_domains without success. I guess there’s something stupid I’m missing any help would be appreciate.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.spider import BaseSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class NYSpider(CrawlSpider):
name = "ny"
allowed_domains = ["nytimes.com"]
start_urls = ["http://www.nytimes.com/"]
rules = (
Rule(SgmlLinkExtractor(allow=('/2012', )), callback='parse_article'),
Rule(SgmlLinkExtractor(allow=('/page', ))),
)
def parse(self, response):
print 'page '+response.url
def parse_article(self, response):
print 'article '+response.url
Any sample of a program doing a crawl of an entire site would be welcomed too.
You use a callback for the rule. From the docs:
So you should do
Additionally to this issue, another IMO might be your parse method:
Though you don’t use this method within the callback, you might override the method from the super class (CrawlSpider). So renamning your parse method might work.
Another issue is that you are not returning an
ItemorRequestin the your methodsNone of your methods does this. The example demonstrates this pretty well. If you override
parseyou still need to return proper Items/Requests.