I am trying to write a spider to crawl certain pages base on the data or info on the index page. And then store the result in a database.
For example, let say I would like to crawl stackoverflow.com/questions/tagged/scrapy
I would go through the index page, if the question is not in my database, then I would store the answer count in the database, then follow the link of the question and crawl that page.
if the question is already in the database, but the number of answer is greater than the one in the database: crawl that page again.
if the question is already in the database and the answer counter is the same: skip this question.
At the moment I could get all the links and answer counts(in this example)on the index page.
but I don’t know how to make the spider to follow the link to the question page base on the answer count.
Is there a way to do this with one spider instead of having two spiders, one spider is getting all the links on the index page, compares the data with the database, exports a json or csv file, and then passes it to another spider to crawl the question page?
Just use BaseSpider. That way you can make all the logic depend on the content you are scraping. I personally prefer BaseSpider since it gives you a lot more control over the scraping process.
The spider should look something like this (this is more of a pseudo code):