Hi i am working on scrapy to scrape xml urls Suppose below is my

Question

0

Editorial Team

Asked: June 8, 20262026-06-08T08:06:00+00:00 2026-06-08T08:06:00+00:00

Hi i am working on scrapy to scrape xml urls Suppose below is my

0

Hi i am working on scrapy to scrape xml urls

Suppose below is my spider.py code

class TestSpider(BaseSpider):
    name = "test"
    allowed_domains = {"www.example.com"}


    start_urls = [
        "https://example.com/jobxml.asp"
        ]


    def parse(self, response):
        print response,"??????????????????????"

result:

2012-07-24 16:43:34+0530 [scrapy] INFO: Scrapy 0.14.3 started (bot: testproject)
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled item pipelines: 
2012-07-24 16:43:34+0530 [test] INFO: Spider opened
2012-07-24 16:43:34+0530 [test] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-07-24 16:43:36+0530 [testproject] DEBUG: Retrying <GET https://example.com/jobxml.asp> (failed 1 times): 400 Bad Request
2012-07-24 16:43:37+0530 [test] DEBUG: Retrying <GET https://example.com/jobxml.asp> (failed 2 times): 400 Bad Request
2012-07-24 16:43:38+0530 [test] DEBUG: Gave up retrying <GET https://example.com/jobxml.asp> (failed 3 times): 400 Bad Request
2012-07-24 16:43:38+0530 [test] DEBUG: Crawled (400) <GET https://example.com/jobxml.asp> (referer: None)
2012-07-24 16:43:38+0530 [test] INFO: Closing spider (finished)
2012-07-24 16:43:38+0530 [test] INFO: Dumping spider stats:
    {'downloader/request_bytes': 651,
     'downloader/request_count': 3,
     'downloader/request_method_count/GET': 3,
     'downloader/response_bytes': 504,
     'downloader/response_count': 3,
     'downloader/response_status_count/400': 3,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2012, 7, 24, 11, 13, 38, 573931),
     'scheduler/memory_enqueued': 3,
     'start_time': datetime.datetime(2012, 7, 24, 11, 13, 34, 803202)}
2012-07-24 16:43:38+0530 [test] INFO: Spider closed (finished)
2012-07-24 16:43:38+0530 [scrapy] INFO: Dumping global stats:
    {'memusage/max': 263143424, 'memusage/startup': 263143424}

Whether scrapy does n’t work for xml scraping, if yes can anyone please provide me an example on how to scrape xml tag data

Thanks in advance………..

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-08T08:06:01+00:00

You have a specific spider made for scraping xml feeds. This is from scrapy documentation:

XMLFeedSpider example

These spiders are pretty easy to use, let’s have a look at one example:

from scrapy import log
from scrapy.contrib.spiders import XMLFeedSpider
from myproject.items import TestItem

class MySpider(XMLFeedSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/feed.xml']
    iterator = 'iternodes' # This is actually unnecesary, since it's the default value
    itertag = 'item'

    def parse_node(self, response, node):
        log.msg('Hi, this is a <%s> node!: %s' % (self.itertag, ''.join(node.extract())))

        item = Item()
        item['id'] = node.select('@id').extract()
        item['name'] = node.select('name').extract()
        item['description'] = node.select('description').extract()
        return item

This is another way without scrapy:

This is a function used to download xml from given url, note that some import are not in here and this will also give you a nice progress for downloading xml file.

def get_file(self, dir, url, name):
    s = urllib2.urlopen(url)
    f = open('xml/test.xml','w')
    meta = s.info()
    file_size = int(meta.getheaders("Content-Length")[0])
    print "Downloading: %s Bytes: %s" % (name, file_size)
    current_file_size = 0
    block_size = 4096
    while True:
        buf = s.read(block_size)
        if not buf:
            break
        current_file_size += len(buf)
        f.write(buf)
        status = ("\r%10d  [%3.2f%%]" %
                 (current_file_size, current_file_size * 100. / file_size))
        status = status + chr(8)*(len(status)+1)
        sys.stdout.write(status)
        sys.stdout.flush()
    f.close()
    print "\nDone getting feed"
    return 1

And then you parse that xml file that you downloaded and saved with iterparse, something like:

for event, elem in iterparse('xml/test.xml'):
        if elem.tag == "properties":
            print elem.text

That’s just an example how do you go through xml tree.

Also, this is an old code of mine, so you would be better of using with for opening files.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Hi i am working on scrapy to scrape xml urls Suppose below is my

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply