I’m having two main problems
1) The parse_item method is not being called/executed after crawling a page
2) When the “callback=’self.parse_item'” is included in the rules, scrapy does not continue to follow the links. Instead, it only follows the links immediately available from the Start Urls.
Here is the code
from scrapy.spider import BaseSpider
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from sheprime.items import SheprimeItem
class HerroomSpider(CrawlSpider):
name = "herroom"
allowed_domains = ["herroom.com"]
start_urls = [
"http://www.herroom.com/simone-perele-12p314-trocadero-sheer-seamless-racerback-bra.shtml",
"http://www.herroom.com/hosiery.aspx",
rules = [
Rule(SgmlLinkExtractor(allow=(r'/[A-Za-z0-9\-]+\.shtml', )), callback='self.parse_item')
]
def parse_item(self, response):
print "some message"
#I have put in this simple parse function, because I just want to get it to work
Thanks for your help,
L
Your code:
It should be:
This works for me:
Results: