I am trying to write a spider to crawl certain pages base on the

Question

0

Asked: June 9, 20262026-06-09T13:48:38+00:00 2026-06-09T13:48:38+00:00

I am trying to write a spider to crawl certain pages base on the

0

I am trying to write a spider to crawl certain pages base on the data or info on the index page. And then store the result in a database.

For example, let say I would like to crawl stackoverflow.com/questions/tagged/scrapy
I would go through the index page, if the question is not in my database, then I would store the answer count in the database, then follow the link of the question and crawl that page.

if the question is already in the database, but the number of answer is greater than the one in the database: crawl that page again.

if the question is already in the database and the answer counter is the same: skip this question.

At the moment I could get all the links and answer counts(in this example)on the index page.
but I don’t know how to make the spider to follow the link to the question page base on the answer count.

Is there a way to do this with one spider instead of having two spiders, one spider is getting all the links on the index page, compares the data with the database, exports a json or csv file, and then passes it to another spider to crawl the question page?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-09T13:48:40+00:00

Just use BaseSpider. That way you can make all the logic depend on the content you are scraping. I personally prefer BaseSpider since it gives you a lot more control over the scraping process.

The spider should look something like this (this is more of a pseudo code):

from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request
from myproject.items import MyItem

class StackOverflow(BaseSpider):
    name = 'stackoverflow.com'
    allowed_domains = ['stackoverflow.com']
    start_urls = ['http://stackoverflow.com/questions']

    def parse(self, response):
        hxs = HtmlXPathSelector(response)

        for question in hxs.select('//question-xpath'):
            question_url = question.select('./question-url')
            answer_count = question.select('./answer-count-xpath')
            # you'll have to write the xpaths and db logic yourself
            if get_db_answer_count(question_url) != answer_count[0]:
                yield Request(question_url, callback = self.parse_question)

    def parse_question(self, response):
        insert_question_and_answers_into_db
        pass

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to write a spider to crawl certain pages base on the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply