I’m scraping a website with Scrapy and would like to split the results into two parts. Usually I call Scrapy like this:
$ scrapy crawl articles -o articles.json
$ scrapy crawl authors -o authors.json
The two spiders are completely independent and don’t communicate at all. This setup works for smaller websites, but larger websites have just too many authors for me to crawl like this.
How would I have the articles spider tell the authors spider what pages to crawl and maintain this two-file structure? Ideally, I’d rather not write the author URLs to a file and then read it back with the other spider.
I ended up using command line arguments for the author scraper:
Then, I added the duplicates pipeline outlined in the Scrapy documentation:
Finally, I passed the article JSON lines file into the command:
It’s not a great solution, but it works.