I need to create a user configurable web spider/crawler, and I’m thinking about using Scrapy. But, I can’t hard-code the domains and allowed URL regex:es — this will instead be configurable in a GUI.
How do I (as simple as possible) create a spider or a set of spiders with Scrapy where the domains and allowed URL regex:es are dynamically configurable? E.g. I write the configuration to a file, and the spider reads it somehow.
WARNING: This answer was for Scrapy v0.7, spider manager api changed a lot since then.
Override default SpiderManager class, load your custom rules from a database or somewhere else and instanciate a custom spider with your own rules/regexes and domain_name
in mybot/settings.py:
in mybot/spidermanager.py:
and now your custom spider class, in mybot/spider.py:
Notes:
./scrapy-ctl.py crawl <name>, wherenameis passed to SpiderManager.fromdomain and is the key to retreive more spider info from the backend system