I am trying to crawl sites in a very basic manner. But Scrapy isn’t

Question

0

Asked: May 27, 20262026-05-27T05:59:09+00:00 2026-05-27T05:59:09+00:00

I am trying to crawl sites in a very basic manner. But Scrapy isn’t

0

I am trying to crawl sites in a very basic manner. But Scrapy isn’t crawling all the links. I will explain the scenario as follows-

main_page.html -> contains links to a_page.html, b_page.html, c_page.html
a_page.html -> contains links to a1_page.html, a2_page.html
b_page.html -> contains links to b1_page.html, b2_page.html
c_page.html -> contains links to c1_page.html, c2_page.html
a1_page.html -> contains link to b_page.html
a2_page.html -> contains link to c_page.html
b1_page.html -> contains link to a_page.html
b2_page.html -> contains link to c_page.html
c1_page.html -> contains link to a_page.html
c2_page.html -> contains link to main_page.html

I am using the following rule in CrawlSpider –

Rule(SgmlLinkExtractor(allow = ()), callback = 'parse_item', follow = True))

But the crawl results are as follows –

DEBUG: Crawled (200) http://localhost/main_page.html> (referer:
None) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/a_page.html> (referer:
http://localhost/main_page.html) 2011-12-05 09:56:07+0530
[test_spider] DEBUG: Crawled (200) http://localhost/a1_page.html>
(referer: http://localhost/a_page.html) 2011-12-05 09:56:07+0530
[test_spider] DEBUG: Crawled (200) http://localhost/b_page.html>
(referer: http://localhost/a1_page.html) 2011-12-05 09:56:07+0530
[test_spider] DEBUG: Crawled (200) http://localhost/b1_page.html>
(referer: http://localhost/b_page.html) 2011-12-05 09:56:07+0530
[test_spider] INFO: Closing spider (finished)

It is not crawling all the pages.

NB – I have made the crawling in BFO as it was indicated in the Scrapy Doc.

What am I missing?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T05:59:09+00:00

I had a similar problem today, although I was using a custom spider.
It turned out that the website was limiting my crawl because my useragent was scrappy-bot

try changing your user agent and try again. Change it to maybe that of a known browser

Another thing you might want to try is adding a delay. Some websites prevent scraping if the time between request is too small. Try adding a DOWNLOAD_DELAY of 2 and see if that helps

More information about DOWNLOAD_DELAY at
http://doc.scrapy.org/en/0.14/topics/settings.html

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to crawl sites in a very basic manner. But Scrapy isn’t

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply