I am trying to scrape an xml file with the below format file_sample.xml: <rss

Question

0

Editorial Team

Asked: June 8, 20262026-06-08T14:47:31+00:00 2026-06-08T14:47:31+00:00

I am trying to scrape an xml file with the below format file_sample.xml: <rss

0

I am trying to scrape an xml file with the below format

file_sample.xml:

<rss version="2.0">
 <channel>
   <item>
       <title>SENIOR BUDGET ANALYST (new)</title>
       <link>https://hr.example.org/psp/hrapp&SeqId=1</link>
       <pubDate>Wed, 18 Jul 2012 04:00:00 GMT</pubDate>
       <category>All Open Jobs</category>
   </item>
   <item>
       <title>BUDGET ANALYST (healthcare)</title>
       <link>https://hr.example.org/psp/hrapp&SeqId=2</link>
       <pubDate>Wed, 18 Jul 2012 04:00:00 GMT</pubDate>
       <category>All category</category>
   </item>
 </channel>
</rss>

Below is my spider.py code

class TestSpider(XMLFeedSpider):
    name = "testproject"
    allowed_domains = {"www.example.com"}
    start_urls = [
        "https://www.example.com/hrapp/rss/careers_jo_rss.xml"
        ]
    iterator = 'iternodes'
    itertag = 'channel'


    def parse_node(self, response, node):
        title = node.select('item/title/text()').extract()
        link  = node.select('item/link/text()').extract()
        pubdate  = node.select('item/pubDate/text()').extract()
        category  = node.select('item/category/text()').extract()
        item = TestprojectItem()
        item['title'] = title
        item['link'] = link
        item['pubdate'] = pubdate
        item['category'] = category
        return item

Result:

2012-07-25 13:24:14+0530 [testproject] DEBUG: Scraped from <200 https://hr.templehealth.org/hrapp/rss/careers_jo_rss.xml>
    {'title': [u'SENIOR BUDGET ANALYST (hospital/healthcare)',
               u'BUDGET ANALYST'],
     'link': [u'https://hr.example.org/psp/hrapp&SeqId=1',
               u'https://hr.example.org/psp/hrapp&SeqId=2'] 
     'pubdate': [u'Wed, 18 Jul 2012 04:00:00 GMT',
               u'Wed, 18 Jul 2012 04:00:00 GMT'] 
     'category': [u'All Open Jobs',
               u'All category'] 
      }

here as u can observe from the above result, all the results from the corresponding tags are combined in to single list, but i want to map according to their individual item tag like below as we do it for html scraping.

    {'title': u'SENIOR BUDGET ANALYST (hospital/healthcare)'
     'link': u'https://hr.example.org/psp/hrapp&SeqId=1'
     'pubdate': u'Wed, 18 Jul 2012 04:00:00 GMT'
     'category': u'All Open Jobs'
      }
    {'title': u'BUDGET ANALYST'
     'link': u'https://hr.example.org/psp/hrapp&SeqId=2' 
     'pubdate': u'Wed, 18 Jul 2012 04:00:00 GMT'
     'category': u'All category'
      }

How can we scrape xml tag data according to separate main tag like item tag above.

Thanks in advance………….

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-08T14:47:32+00:00

I recommend the use of feedparser:

feedparser.parse(url)

results in

{'bozo': 1,
 'bozo_exception': xml.sax._exceptions.SAXParseException("EntityRef: expecting ';'\n"),
 'encoding': u'utf-8',
 'entries': [{'link': u'https://hr.example.org/psp/hrapp&SeqId=1',
   'links': [{'href': u'https://hr.example.org/psp/hrapp&SeqId=1',
     'rel': u'alternate',
     'type': u'text/html'}],
   'tags': [{'label': None, 'scheme': None, 'term': u'All Open Jobs'}],
   'title': u'SENIOR BUDGET ANALYST (new)',
   'title_detail': {'base': u'',
    'language': None,
    'type': u'text/plain',
    'value': u'SENIOR BUDGET ANALYST (new)'},
   'updated': u'Wed, 18 Jul 2012 04:00:00 GMT',
   'updated_parsed': time.struct_time(tm_year=2012, tm_mon=7, tm_mday=18, tm_hour=4, tm_min=0, tm_sec=0, tm_wday=2, tm_yday=200, tm_isdst=0)},
  {'link': u'https://hr.example.org/psp/hrapp&SeqId=2',
   'links': [{'href': u'https://hr.example.org/psp/hrapp&SeqId=2',
     'rel': u'alternate',
     'type': u'text/html'}],
   'tags': [{'label': None, 'scheme': None, 'term': u'All category'}],
   'title': u'BUDGET ANALYST (healthcare)',
   'title_detail': {'base': u'',
    'language': None,
    'type': u'text/plain',
    'value': u'BUDGET ANALYST (healthcare)'},
   'updated': u'Wed, 18 Jul 2012 04:00:00 GMT',
   'updated_parsed': time.struct_time(tm_year=2012, tm_mon=7, tm_mday=18, tm_hour=4, tm_min=0, tm_sec=0, tm_wday=2, tm_yday=200, tm_isdst=0)}],
 'feed': {},
 'namespaces': {},
 'version': u'rss20'}

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to scrape an xml file with the below format file_sample.xml: <rss

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply