I recently discovered Scrapy which i find very efficient. However, I really don’t see how to embed it in a larger project written in python. I would like to create a spider in the normal way but be able to launch it on a given url with a function
start_crawl(url)
which would launch the crawling process on a given domain and stop only when all the pages have been seen.
Scrapy is much more complicated. It runs several processes and use multi-threating. So in fact there are no way to use it as normal python function. Of course you can import function that starts crawler and invoke it, but what then? You will have normal scrappy process, that has taken control of your program.
Probably the best approach here is to run scrappy as subprocess of your program and communicate with it using database or file. You have good separation between your program and crawler, and solid control over main process.