I am trying to do:
class SpiderSpider(CrawlSpider):
name = "lolies"
allowed_domains = ["domain.com"]
start_urls = ['http://www.domain.com/directory/lol2']
rules = (Rule(SgmlLinkExtractor(allow=[r'directory/lol2/\w+$']), follow=True), Rule(SgmlLinkExtractor(allow=[r'directory/lol2/\w+/\d+$']), follow=True),Rule(SgmlLinkExtractor(allow=[r'directory/lol2/\d+$']), callback=self.parse_loly))
def parse_loly(self, response):
print 'Hi this is the loly page %s' % response.url
return
That returns me:
NameError: name 'self' is not defined
If I change callback to callback="self.parse_loly" seems never to be called and print the URL.
But seems to be crawling the sites with no problems because I get many Crawled 200 debug messages for that rule.
What could I be doing wrong?
Thanks in advance guys!
It looks like the whitespace for
parse_lolyis not aligned correctly. Python is whitespace sensitive so to the interpreter it looks like a method outside of SpiderSpider.You might also want to split your rules line into shorter lines as per PEP8.
Try this: