I’m scraping a website with Scrapy and would like to split the results into

Question

0

Asked: June 18, 20262026-06-18T10:09:50+00:00 2026-06-18T10:09:50+00:00

I’m scraping a website with Scrapy and would like to split the results into

0

I’m scraping a website with Scrapy and would like to split the results into two parts. Usually I call Scrapy like this:

$ scrapy crawl articles -o articles.json
$ scrapy crawl authors  -o  authors.json

The two spiders are completely independent and don’t communicate at all. This setup works for smaller websites, but larger websites have just too many authors for me to crawl like this.

How would I have the articles spider tell the authors spider what pages to crawl and maintain this two-file structure? Ideally, I’d rather not write the author URLs to a file and then read it back with the other spider.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-18T10:09:52+00:00

I ended up using command line arguments for the author scraper:

class AuthorSpider(BaseSpider):
    ...

    def __init__(self, articles):
        self.start_urls = []

        for line in articles:
            article = json.loads(line)
            self.start_urls.append(data['author_url'])

Then, I added the duplicates pipeline outlined in the Scrapy documentation:

from scrapy import signals
from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):
    def __init__(self):
        self.ids_seen = set()

    def process_item(self, item, spider):
        if item['id'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['id'])
            return item

Finally, I passed the article JSON lines file into the command:

$ scrapy crawl authors -o authors.json -a articles=articles.json

It’s not a great solution, but it works.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m scraping a website with Scrapy and would like to split the results into

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply