I need to create a spider that crawls for some data from web site.
part of the data is an external URL.
I already created the spider that crawls the data from the root site and now i want to write the spider for external web pages.
I was thinking of creating a crawlspider that uses the SgmlLinkExtractor to follow some specific links in each external web page.
what is the recommended way to communicate the list of start_url to the second spider?
My idea is to generate a json file for the items and to read the attribute in start_requests of the second spider.
Save these external page urls to a db.
Override
BaseSpider.start_requestsin your other spider and create requests from urls you get from the db.