Trying to scrape a Y! Group and I can get data from one page

Question

0

Asked: May 20, 20262026-05-20T22:01:36+00:00 2026-05-20T22:01:36+00:00

Trying to scrape a Y! Group and I can get data from one page

0

Trying to scrape a Y! Group and I can get data from one page but that’s it. I’ve got some basic rules but clearly they aren’t right. Anyone already solved this one?

class YgroupSpider(CrawlSpider):
name = "yahoo.com"
allowed_domains = ["launch.groups.yahoo.com"]
start_urls = [
    "http://launch.groups.yahoo.com/group/random_public_ygroup/post"
]

rules = (
    Rule(SgmlLinkExtractor(allow=('message','messages' ), deny=('mygroups', ))),
    Rule(SgmlLinkExtractor(), callback='parse_item'),
)


def parse_item(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('/html')
    item = Item()
    for site in sites:
        item = YgroupItem()
        item['title'] = site.select('//title').extract()
        item['pubDate'] = site.select('//abbr[@class="updated"]/text()').extract()
        item['desc'] = site.select("//div[contains(concat(' ',normalize-space(@class),' '),' entry-content ')]/text()").extract()
    return item

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-20T22:01:37+00:00

Looks like you have almost no idea what you are doing. I’m pretty new to Scrapy but I think you will want to have something like
Rule(SgmlLinkExtractor(allow=('http\://example\.com/message/.*\.aspx', )), callback='parse_item'),
Try to write a regular expression that matches the complete link URL you want. Also, it looks like you only need one rule. Add the callback to the first one. The link extractor matches every link matched with the regular expression in allow and from those excludes those matched by deny, and from there each of the remaining pages will be loaded and passed to parse_item.

I’m saying all this without knowing really anything about the page you are data mining and the nature of the data you want. You want this sort of spider for a page which has links to the pages that have the data you want.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Trying to scrape a Y! Group and I can get data from one page

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply