I’m scraping a website that’s structured like this:
Archive
Article 1
Authors
Author 1
Author 2
Title
Body
Comments
Comment 1
Comment 2
...
Each of the authors in Authors has their own profile page. The problem is that authors write multiple articles, so I end up scraping the same authors’ profiles over and over as my spiders crawl the site.
How would I cache the author profiles with Scrapy?
You should add Duplicates filter like in the following example:
and activate that DuplicatesPipeline in ITEM_PIPELINES list,