I have a CrawlSpider, the code is below. I use Tor through tsocks. When

Question

0

Asked: June 18, 20262026-06-18T02:17:42+00:00 2026-06-18T02:17:42+00:00

I have a CrawlSpider, the code is below. I use Tor through tsocks. When

0

I have a CrawlSpider, the code is below. I use Tor through tsocks.
When I start my spider, everything works fine. Using init_request I can login on site and crawl with sticky cookies.

But problem occurred when I stopped and resumed spider. Cookies became not sticky.

I give you the response from Scrapy.

=======================INIT_REQUEST================
2013-01-30 03:03:58+0300 [my] INFO: Spider opened
2013-01-30 03:03:58+0300 [my] INFO: Resuming crawl (675 requests scheduled)
............ And here crawling began

So… callback=self.login_url in def init_request is not fired!!!

I thought that scrapy engine didn’t want to send again request on login page. Before resuming scrapy I changed login_page (I can login from every page on site) to different that not included in restrict_xpaths.

Result is – After resuming I cannot login and previous cookies are lost.

Does anyone have some assumptions?

from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import TakeFirst, MapCompose, Join, Identity
from beles_com_ua.items import Product
from scrapy.utils.response import get_base_url
from scrapy.utils.url import urljoin_rfc
from scrapy.utils.markup import remove_entities
from django.utils.html import strip_tags
from datetime import datetime
from scrapy import log
import re
from scrapy.http import Request, FormRequest

class ProductLoader(XPathItemLoader):
    .... some code is here ...


class MySpider(CrawlSpider):
    name = 'my'
    login_page = 'http://test.com/index.php?section=6&type=12'

    allowed_domains = ['test.com']
    start_urls = [
        'http://test.com/index.php?section=142',
    ]
    rules = (
        Rule(SgmlLinkExtractor(allow=('.',),restrict_xpaths=('...my xpath...')),callback='parse_item', follow=True),
    )
    def start_requests(self):
        return self.init_request()

    def init_request(self):
        print '=======================INIT_REQUEST================'
        return [Request(self.login_page, callback=self.login_url)]


    def login_url(self, response):
        print '=======================LOGIN======================='
        """Generate a login request."""
        return FormRequest.from_response(response,
            formdata={'login': 'mylogin', 'pswd': 'mypass'},
            callback=self.after_login)

    def after_login(self, response):
        print '=======================AFTER_LOGIN ...======================='
        if "images/info_enter.png" in response.body:
               print "==============Bad times :(==============="
        else:
           print "=========Successfully logged in.========="
           for url in self.start_urls:
            yield self.make_requests_from_url(url)

    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)

        entry = hxs.select("//div[@class='price']/text()").extract()
        l = ProductLoader(Product(), hxs)
        if entry:
        name = hxs.select("//div[@class='header_box']/text()").extract()[0]
        l.add_value('name', name)
        ... some code is here ...
        return l.load_item()

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-18T02:17:43+00:00

The init_request(self): is available only when you subclass from InitSpider not CrawlSpider

You need to subclass your spider from InitSpider like this

class WorkingSpider(InitSpider):

    login_page = 'http://www.example.org/login.php'
    def init_request(self):
        #"""This function is called before crawling starts."""
        return Request(url=self.login_page, callback=self.login)

But then remember that you can’t define Rules in initSpider as its only avaiable in CrawlSpider you need to manually extract the links

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a CrawlSpider, the code is below. I use Tor through tsocks. When

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply