I have been working through the tutorial adapting it to a project I want

Question

0

Asked: May 31, 20262026-05-31T23:03:05+00:00 2026-05-31T23:03:05+00:00

I have been working through the tutorial adapting it to a project I want

0

I have been working through the tutorial adapting it to a project I want to achieve. I seem to have something going wrong that i just can’t find the error to.

When using ‘scrapy shell’ I can get the response I expect. So for this site Nrl Ladder

In [1]: hxs.select('//td').extract()
Out[1]: 
[u'<td>\r\n<div id="ls-nav">\r\n<ul><li><a href="http://www.nrlstats.com/"><span>Home</span></a></li>\r\n<li class="ls-nav-on"><a href="/nrl"><span>NRL</span></a></li>\r\n<li><a href="/nyc"><span>NYC</span></a></li>\r\n<li><a href="/rep"><span>Rep Matches</span></a></li>\r\n\r\n</ul></div>\r\n</td>',
 u'<td style="text-align:left" colspan="5">Round 4</td>',
 u'<td colspan="5">Updated: 26/3/2012</td>',
 u'<td style="text-align:left">1. Melbourne</td>',
 u'<td>4</td>',
 u'<td>4</td>',
 u'<td>0</td>',
 u'<td>0</td>',
 u'<td>0</td>',
 u'<td>122</td>',
 u'<td>39</td>',
 u'<td>83</td>',
 u'<td>8</td>',
 u'<td style="text-align:left">2. Canterbury-Bankstown</td>',

And on it goes.

I am really struggling to understand how to alter the tutorial project to change it to a different data type.

Is there anyway to bring up a help or documentation list to see what types I should use in items when using ‘td’ or any other item. Like i say it works easy in the shell but I cannot transform it to the files. Specifically both the team names and the points are ‘td’ but the team name is text.

here is what I have done.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from nrl.items import NrlItem

class nrl(BaseSpider):
    name = "nrl"
    allowed_domains = ["http://live.nrlstats.com/"]
    start_urls = [
        "http://live.nrlstats.com/nrl/ladder.html",
        ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//td')
        items = []
        for site in sites:
           item = nrlItem()
           item['team'] = site.select('/text()').extract()
           item['points'] = site.select('/').extract()
           items.append(item)
        return items

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-31T23:03:06+00:00

I didn’t quite understand your question, but here is a starting point, imo (haven’t tested; see some comments in the code):

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from nrl.items import NrlItem

class nrl(BaseSpider):
    name = "nrl"
    allowed_domains = ["live.nrlstats.com"] # domains should be like this
    start_urls = [
        "http://live.nrlstats.com/nrl/ladder.html",
        ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        rows = hxs.select('//table[@class="tabler"]//tr[starts-with(@class, "r")]') # select team rows
        items = []
        for row in rows:
           item = nrlItem()
           columns = row.select('./td/text()').extract() # select columns for the selected row
           item['team'] = columns[0]
           item['P'] = int(columns[1])
           item['W'] = int(columns[2])
           ...
           items.append(item)
        return items

UPDATE:

//table[@class="tabler"//tr[starts-with(@class, "r")] is an xpath query. See some xpath examples here.

hxs.select(xpath_query) always returns a list of nodes (also of type HtmlXPathSelector) which fall under the given query.

hxs.extract() returns string representation of the node(s).

P.S. Beware that scrapy supports XPath 1.0, but not 2.0 (at least on Linux, not sure about Windows), so some of the newest xpath features might not work.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have been working through the tutorial adapting it to a project I want

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply