I’m trying to test Scrapy for crawling web pages and I do not understand

Question

0

Asked: May 31, 20262026-05-31T20:05:47+00:00 2026-05-31T20:05:47+00:00

I’m trying to test Scrapy for crawling web pages and I do not understand

0

I’m trying to test Scrapy for crawling web pages and I do not understand why my crawler is crawling only one page, I’ve tried to comment both rules and allowed_domains without success. I guess there’s something stupid I’m missing any help would be appreciate.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.spider import BaseSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class NYSpider(CrawlSpider):
    name = "ny"
    allowed_domains = ["nytimes.com"]
    start_urls = ["http://www.nytimes.com/"]

    rules = (
        Rule(SgmlLinkExtractor(allow=('/2012', )),  callback='parse_article'),
        Rule(SgmlLinkExtractor(allow=('/page', ))),
    )

    def parse(self, response):
        print 'page '+response.url

    def parse_article(self, response):
        print 'article '+response.url

Any sample of a program doing a crawl of an entire site would be welcomed too.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-31T20:05:48+00:00

You use a callback for the rule. From the docs:

follow is a boolean which specifies if links should be followed from
each response extracted with this rule. If callback is None follow
defaults to True, otherwise it default to False.

So you should do

 Rule(SgmlLinkExtractor(allow=('/2012', )),  callback='parse_article', follow=True)

Additionally to this issue, another IMO might be your parse method:

Warning
When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

Though you don’t use this method within the callback, you might override the method from the super class (CrawlSpider). So renamning your parse method might work.

Another issue is that you are not returning an Item or Request in the your methods

must return a list containing Item and/or Request objects (or any subclass of them)

None of your methods does this. The example demonstrates this pretty well. If you override parse you still need to return proper Items/Requests.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to test Scrapy for crawling web pages and I do not understand

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply