I’m scraping a website that’s structured like this: Archive Article 1 Authors Author 1

Question

0

Asked: June 17, 20262026-06-17T20:50:34+00:00 2026-06-17T20:50:34+00:00

I’m scraping a website that’s structured like this: Archive Article 1 Authors Author 1

0

I’m scraping a website that’s structured like this:

Archive
    Article 1
        Authors
            Author 1
            Author 2
        Title
        Body
        Comments
            Comment 1
            Comment 2
    ...

Each of the authors in Authors has their own profile page. The problem is that authors write multiple articles, so I end up scraping the same authors’ profiles over and over as my spiders crawl the site.

How would I cache the author profiles with Scrapy?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T20:50:36+00:00

You should add Duplicates filter like in the following example:

from scrapy import signals
from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

    def __init__(self):
        self.author_ids_seen = set()

    def process_item(self, item, spider):
        if item['author_id'] in self.author_ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['author_id'])
            return item

and activate that DuplicatesPipeline in ITEM_PIPELINES list,

ITEM_PIPELINES = [
    'myproject.pipeline.DuplicatesPipeline',
]

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m scraping a website that’s structured like this: Archive Article 1 Authors Author 1

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply