I’m using Scrapy, in particular Scrapy’s CrawlSpider class to scrape web links which contain certain keywords. I have a pretty long start_urls list which gets its entries from a SQLite database which is connected to a Django project. I want to save the scraped web links in this database.
I have two Django models, one for the start urls such as http://example.com and one for the scraped web links such as http://example.com/website1, http://example.com/website2 etc. All scraped web links are subsites of one of the start urls in the start_urls list.
The web links model has a many-to-one relation to the start url model, i.e. the web links model has a Foreignkey to the start urls model. In order to save my scraped web links properly to the database, I need to tell the CrawlSpider‘s parse_item() method which start url the scraped web link belongs to. How can I do that? Scrapy’s DjangoItem class does not help in this respect as I still have to define the used start url explicitly.
In other words, how can I pass the currently used start url to the parse_item() method, so that I can save it together with the appropriate scraped web links to the database? Any ideas? Thanks in advance!
By default you can not access the original start url.
But you can override
make_requests_from_urlmethod and put the start url into ameta. Then in a parse you can extract it from there (if you yield in that parse method subsequent requests, don’t forget to forward that start url in them).I haven’t worked with
CrawlSpiderand maybe what Maxim suggests will work for you, but keep in mind thatresponse.urlhas the url after possible redirections.Here is an example of how i would do it, but it’s just an example (taken from the scrapy tutorial) and was not tested:
Ask if you have any questions. BTW, using PyDev’s ‘Go to definition’ feature you can see scrapy sources and understand what parameters
Request,make_requests_from_urland other classes and methods expect. Getting into the code helps and saves you time, though it might seem difficult at the beginning.