Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 3876432
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 19, 20262026-05-19T22:22:52+00:00 2026-05-19T22:22:52+00:00

I’m new to web scraping and just started experimenting with Scrapy , a scraping

  • 0

I’m new to web scraping and just started experimenting with Scrapy, a scraping framework written in Python. My goal is to scrape an old Yahoo Group since they don’t provide an API or any other means to retrieve message archives. The Yahoo Group is set such that you have to log in before you can view the archives.

The steps I need to accomplish, I think, are:

  1. Log into yahoo
  2. Visit the URL for the first message and scrape it
  3. Repeat step 2 for the next message, etc

I started roughing out a scrapy spider to accomplish the above, and here is what I have so far. All I want to observe is that the login works and I am able to retrieve the first message. I’ll finish the rest once I get this much working:

class Sg101Spider(BaseSpider):
    name = "sg101"
    msg_id = 1              # current message to retrieve
    max_msg_id = 21399      # last message to retrieve

    def start_requests(self):
        return [FormRequest(LOGIN_URL,
            formdata={'login': LOGIN, 'passwd': PASSWORD},
            callback=self.logged_in)]

    def logged_in(self, response):
        if response.url == 'http://my.yahoo.com':
            self.log("Successfully logged in. Now requesting 1st message.")
            return Request(MSG_URL % self.msg_id, callback=self.parse_msg,
                    errback=self.error)
        else:
            self.log("Login failed.")

    def parse_msg(self, response):
        self.log("Got message!")
        print response.body

    def error(self, failure):
        self.log("I haz an error")

When I run the spider though, I see it login and issue the request for the first message. However, all I see in the debug output from scrapy is 3 redirects, eventually arriving at the URL I asked for in the first place. But scrapy does not call my parse_msg() callback, and the crawling stops. Here is a snippet of the scrapy output:

2011-02-03 19:50:10-0600 [sg101] INFO: Spider opened
2011-02-03 19:50:10-0600 [sg101] DEBUG: Redirecting (302) to <GET https://login.yahoo.com/config/verify?.done=http%3a//my.yahoo.com> from <POST https://login.yahoo.com/config/login>
2011-02-03 19:50:10-0600 [sg101] DEBUG: Redirecting (meta refresh) to <GET http://my.yahoo.com> from <GET https://login.yahoo.com/config/verify?.done=http%3a//my.yahoo.com>
2011-02-03 19:50:12-0600 [sg101] DEBUG: Crawled (200) <GET http://my.yahoo.com> (referer: None)
2011-02-03 19:50:12-0600 [sg101] DEBUG: Successfully logged in. Now requesting 1st message.
2011-02-03 19:50:12-0600 [sg101] DEBUG: Redirecting (302) to <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?done=http%3A%2F%2Flaunch.groups.yahoo.com%2Fgroup%2FMyYahooGroup%2Fmessage%2F1> from <GET http://launch.groups.yahoo.com/group/MyYahooGroup/message/1>
2011-02-03 19:50:12-0600 [sg101] DEBUG: Redirecting (302) to <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?check=G&done=http%3A%2F%2Flaunch%2Egroups%2Eyahoo%2Ecom%2Fgroup%2FMyYahooGroup%2Fmessage%2F1> from <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?done=http%3A%2F%2Flaunch.groups.yahoo.com%2Fgroup%2FMyYahooGroup%2Fmessage%2F1>
2011-02-03 19:50:13-0600 [sg101] DEBUG: Redirecting (302) to <GET http://launch.groups.yahoo.com/group/MyYahooGroup/message/1> from <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?check=G&done=http%3A%2F%2Flaunch%2Egroups%2Eyahoo%2Ecom%2Fgroup%2FMyYahooGroup%2Fmessage%2F1>
2011-02-03 19:50:13-0600 [sg101] INFO: Closing spider (finished)
2011-02-03 19:50:13-0600 [sg101] INFO: Spider closed (finished)

I am unable to make sense of this. It looks like Yahoo is redirecting the spider (maybe for auth checking?) but it seems to arrive back at the URL I wanted to visit in the first place. But scrapy doesn’t call my callback and I don’t get a chance to scrape the data or continue crawling.

Does anyone have any ideas on what is happening and/or how to debug this further? Thanks!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-19T22:22:52+00:00Added an answer on May 19, 2026 at 10:22 pm

    I think Yahoo is redirecting for an authorization check, and it finally redirects me back to the page I really wanted to get. Scrapy has already seen this request, however, and stops because it doesn’t want to get into a loop. The solution, in my case, is to add dont_filter=True to the Request constructor. This will instruct Scrapy to not filter out duplicate requests. This is fine in my case, because I know in advance what URLs I want to crawl.

    def logged_in(self, response):
        if response.url == 'http://my.yahoo.com':
            self.log("Successfully logged in. Now requesting message page.",
                    level=log.INFO)
            return Request(MSG_URL % self.msg_id, callback=self.parse_msg,
                    errback=self.error, dont_filter=True)
        else:
            self.log("Login failed.", level=log.CRITICAL)
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I want use html5's new tag to play a wav file (currently only supported
I'm new to using the Perl treebuilder module for HTML parsing and can't figure
link Im having trouble converting the html entites into html characters, (&# 8217;) i
Seemingly simple, but I cannot find anything relevant on the web. What is the
I've got a string that has curly quotes in it. I'd like to replace
I'm looking for suggestions for debugging... If you view this site in Firefox or
I have a jquery bug and I've been looking for hours now, I can't
Does anyone know how can I replace this 2 symbol below from the string
this is what i have right now Drawing an RSS feed into the php,
I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.