I can’t figure out why Scrapy is crawling the first page but not following

Question

0

Asked: June 13, 20262026-06-13T19:56:07+00:00 2026-06-13T19:56:07+00:00

I can’t figure out why Scrapy is crawling the first page but not following

0

I can’t figure out why Scrapy is crawling the first page but not following the links to crawl the subsequent pages. It must be something to do with the Rules. Much appreciated. Thank you!

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistItem

class MySpider(CrawlSpider):
    name = "craig"
    allowed_domains = ["sfbay.craigslist.org"]
    start_urls = ["http://sfbay.craigslist.org/acc/"]   

    rules = (Rule (SgmlLinkExtractor(allow=("index100\.html", ),restrict_xpaths=('//p[@id="nextpage"]',))
    , callback="parse_items", follow= True),
    )   

    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select("//p")
        items = []
        for titles in titles:
            item = CraigslistItem()
            item ["title"] = titles.select("a/text()").extract()
            item ["link"] = titles.select("a/@href").extract()
            items.append(item)
        return(items)

spider = MySpider()

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-13T19:56:08+00:00

Editorial Team

2026-06-13T19:56:08+00:00Added an answer on June 13, 2026 at 7:56 pm

Craig uses index100,index200,index300… for next pages, max is index900.

rules = (Rule (SgmlLinkExtractor(allow=("index\d00\.html", ),restrict_xpaths=('//p[@id="nextpage"]',))
, callback="parse_items", follow= True),
)

works for me.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I can’t figure out why Scrapy is crawling the first page but not following

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply