The website I am scraping has javascript that sets a cookie and checks it in the backend to make sure js is enabled. Extracting the cookie from the html code is simple enough, but then setting it seems to be a problem in scrapy. So my code is:
from scrapy.contrib.spiders.init import InitSpider
class TestSpider(InitSpider):
...
rules = (Rule(SgmlLinkExtractor(allow=('products/./index\.html', )), callback='parse_page'),)
def init_request(self):
return Request(url = self.init_url, callback=self.parse_js)
def parse_js(self, response):
match = re.search('setCookie\(\'(.+?)\',\s*?\'(.+?)\',', response.body, re.M)
if match:
cookie = match.group(1)
value = match.group(2)
else:
raise BaseException("Did not find the cookie", response.body)
return Request(url=self.test_page, callback=self.check_test_page, cookies={cookie:value})
def check_test_page(self, response):
if 'Welcome' in response.body:
self.initialized()
def parse_page(self, response):
scraping....
I can see that the content is available in check_test_page, the cookie works perfectly. But it never even gets to parse_page since CrawlSpider without the right cookie doesn’t see any links. Is there a way to set a cookie for the duration of the scraping session? Or do I have to use BaseSpider and add the cookie to every request manually?
A less desirable alternative would be to set the cookie (the value seems to never change) through scrapy configuration files somehow. Is that possible?
It turned out that InitSpider is a BaseSpider. So it looks like 1) there’s no way to use CrawlSpider in this situation 2) there’s no way to set a sticky cookie