I’m having a problem iterating a crawl using scrapy. I am extracting a title

Question

0

Asked: June 1, 20262026-06-01T14:33:02+00:00 2026-06-01T14:33:02+00:00

I’m having a problem iterating a crawl using scrapy. I am extracting a title

0

I’m having a problem iterating a crawl using scrapy. I am extracting a title field and a content field. The problem is that I get a JSON file with all of the titles listed and then all of the content. I’d like to get {title}, {content}, {title}, {content}, meaning I probably have to iterate through the parse function. The problem is that I cannot figure out what element I am looping through (i.e., for x in [???]) Here is the code:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import SitemapSpider

from Foo.items import FooItem


class FooSpider(SitemapSpider):
    name = "foo"
    sitemap_urls = ['http://www.foo.com/sitemap.xml']
    #sitemap_rules = [


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        items = [
        item = FooItem()
        item['title'] = hxs.select('//span[@class="headline"]/text()').extract()
        item['content'] = hxs.select('//div[@class="articletext"]/text()').extract()
        items.append(item)
        return items

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-01T14:33:03+00:00

Your xpath queries returns all titles and all contents on the page. I suppose you can do:

titles = hxs.select('//span[@class="headline"]/text()').extract()
contents = hxs.select('//div[@class="articletext"]/text()').extract()

for title, context in zip(titles, contents):
    item = FooItem()
    item['title'] = title
    item['content'] = context
    yield item

But it is not reliable. Try to perform xpath query that return block with title and content inside. If you showed me xml source I’d help you.

blocks = hxs.select('//div[@class="some_filter"]')
for block in blocks:
    item = FooItem()
    item['title'] = block.select('span[@class="headline"]/text()').extract()
    item['content'] = block.select('div[@class="articletext"]/text()').extract()
    yield item

I’m not sure about xpath queries but I think idea is clear.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m having a problem iterating a crawl using scrapy. I am extracting a title

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply