i am just trying to crawl a single page
start_urls = ['https://www.mileageplusshopping.com/shopping/b____alpha.htm']
but redirecting again and again, at the end scrapy Discard it
although i tried to
REDIRECT_MAX_TIMES=100
this setting as well but still its redirecting 100 times and scrapy Discard it
any help would be appreciated
here is the log.
2012-04-02 20:10:53+0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false> from <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false> from <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false> from <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false> from <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:54+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false> from <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:54+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false>
2012-04-02 20:10:54+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:54+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false> from <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:54+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false>
2012-04-02 20:10:54+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:54+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false> from <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:54+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false>
2012-04-02 20:10:54+0500 [mileageplusshopping] DEBUG: Discarding <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm>: max redirections reached
2012-04-02 20:10:54+0500 [mileageplusshopping] ERROR: Error downloading <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm>:
2012-04-02 20:10:54+0500 [mileageplusshopping] INFO: Closing spider (finished)
i am on scrapy 0.14
here is my setting class
BOT_NAME = 'mall_crawler'
BOT_VERSION = '1.0'
SPIDER_MODULES = ['mall_crawler.spiders']
NEWSPIDER_MODULE = 'mall_crawler.spiders'
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)
USER_AGENT = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8'
RANDOMIZE_DOWNLOAD_DELAY = True
DOWNLOAD_DELAY = 1
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': None,
}
SCHEDULER_ORDER = 'BFO'
i found solution so i would like to share you people
its just because of
actually the start_url is
https://www.mileageplusshopping.com/shopping/b____alpha.htmthat redirects to
https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=falsethat redirects to
https://x.www.mileageplusshopping.com/shopping/b____alpha.htmand finally that redirects to
https://www.mileageplusshopping.com/shopping/b____alpha.htmif you look on first request and last request both are same
thats why at last request it found this request in cache and a loop starts on, so if we do not cached the pages all is well.
or we need to handle all this manually if we want to cache the pages.