Trying to scrape a Y! Group and I can get data from one page but that’s it. I’ve got some basic rules but clearly they aren’t right. Anyone already solved this one?
class YgroupSpider(CrawlSpider):
name = "yahoo.com"
allowed_domains = ["launch.groups.yahoo.com"]
start_urls = [
"http://launch.groups.yahoo.com/group/random_public_ygroup/post"
]
rules = (
Rule(SgmlLinkExtractor(allow=('message','messages' ), deny=('mygroups', ))),
Rule(SgmlLinkExtractor(), callback='parse_item'),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('/html')
item = Item()
for site in sites:
item = YgroupItem()
item['title'] = site.select('//title').extract()
item['pubDate'] = site.select('//abbr[@class="updated"]/text()').extract()
item['desc'] = site.select("//div[contains(concat(' ',normalize-space(@class),' '),' entry-content ')]/text()").extract()
return item
Looks like you have almost no idea what you are doing. I’m pretty new to Scrapy but I think you will want to have something like
Rule(SgmlLinkExtractor(allow=('http\://example\.com/message/.*\.aspx', )), callback='parse_item'),Try to write a regular expression that matches the complete link URL you want. Also, it looks like you only need one rule. Add the callback to the first one. The link extractor matches every link matched with the regular expression in allow and from those excludes those matched by deny, and from there each of the remaining pages will be loaded and passed to
parse_item.I’m saying all this without knowing really anything about the page you are data mining and the nature of the data you want. You want this sort of spider for a page which has links to the pages that have the data you want.