EDIT: THIS HAS BEEN RESOLVED! XPATH WAS THE ISSUE. I’m very confused. I’m trying

Question

0

Asked: June 17, 20262026-06-17T04:30:11+00:00 2026-06-17T04:30:11+00:00

EDIT: THIS HAS BEEN RESOLVED! XPATH WAS THE ISSUE. I’m very confused. I’m trying

0

EDIT: THIS HAS BEEN RESOLVED! XPATH WAS THE ISSUE.

I’m very confused. I’m trying to write a very simple spider, to crawl a website (talkbass.com) to get me a list of all the links in the classified bass section (http://www.talkbass.com/forum/f126/ ). I wrote the spider based off of the tutorial (which I completed with relative ease) and this one is just not working. I am probably doing a lot wrong, as I tried to incorporate a Rule as well, but I’m just getting nothing back.

My item code is:

from scrapy.item import Item, Field
class BassItem(Item):
    title = Field()
    link = Field()
    print title, link

my spider code is:

from scrapy.spider import BaseSpider 
from scrapy.contrib.spiders import Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import HtmlXPathSelector
from tutorial.items import BassItem
class BassSpider(BaseSpider):
    name = "bass"
    allowed_domains = ["www.talkbass.com"]
    start_urls = ["http://www.talkbass.com/forum/f126/"]

    rules = (
        # Extract links matching 'f126/xxx'
        # and follow links from them (since no callback means follow=True by default).
        Rule(SgmlLinkExtractor(allow=('/f126/(\d*)/', ), ))
    )

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        ads = hxs.select('/html/body/div/div/div/table/tbody/tr/td/form/table/tbody')
        items = []
        for ad in ads:
            item = BassItem()
            item['title'] = ad.select('a/text()').extract()
            item['link'] = ad.select('a/@href').extract()
            items.append(item)
        return items

I don’t get any errors, but the log just doesn’t return anything. Here’s what I see in the console:

C:\Python27\Scrapy\tutorial>scrapy crawl bass
2013-01-07 14:36:49+0800 [scrapy] INFO: Scrapy 0.16.3 started (bot: tutorial)
2013-01-07 14:36:49+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, Co
reStats, SpiderState
{} {}
2013-01-07 14:36:51+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddl
eware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, Htt
pCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-01-07 14:36:51+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, Refe
rerMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-01-07 14:36:51+0800 [scrapy] DEBUG: Enabled item pipelines:
2013-01-07 14:36:51+0800 [bass] INFO: Spider opened
2013-01-07 14:36:51+0800 [bass] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-01-07 14:36:51+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-01-07 14:36:51+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-01-07 14:36:52+0800 [bass] DEBUG: Crawled (200) <GET http://www.talkbass.com/forum/f126/> (referer: None)
2013-01-07 14:36:52+0800 [bass] INFO: Closing spider (finished)
2013-01-07 14:36:52+0800 [bass] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 233,
         'downloader/request_count': 1,
         'downloader/request_method_count/GET': 1,
         'downloader/response_bytes': 17997,
         'downloader/response_count': 1,
         'downloader/response_status_count/200': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2013, 1, 7, 6, 36, 52, 305000),
         'log_count/DEBUG': 7,
         'log_count/INFO': 4,
         'response_received_count': 1,
         'scheduler/dequeued': 1,
         'scheduler/dequeued/memory': 1,
         'scheduler/enqueued': 1,
         'scheduler/enqueued/memory': 1,
         'start_time': datetime.datetime(2013, 1, 7, 6, 36, 51, 458000)}
2013-01-07 14:36:52+0800 [bass] INFO: Spider closed (finished)

I’ve never done an actual project and I thought this would be a good place to start but I can’t seem to straighten this out.

I’m also not sure if the XPath is correct. I’m using a chrome extension called XPath helper. An example of one of the sections I need is this:

/html/body/div[1]/div[@class='page']/div/table[5]/tbody/tr/td[2]/form[@id='inlinemodform']/table[@id='threadslist']/tbody[@id='threadbitsforum126']/tr[6]/td[@id='td_threadtitle944468']/div[1]/a[@id='thread_title944468']

However if you see the ” tr[6] ” and the “944468” – those are not constant for each link (everything else is). I just removed the class names and the numbers which left me with what you see in my spider code.

Also just to add – when I copy and paste the XPath directly from XPath Helper, it gives a syntax error:

    ads = hxs.select('/html/body/div[1]/div[@class='page']/div/table[5]/tbody/tr/td[2]/form[@id='inlinemodform']/table[@id='threadslist']/tbody[@id='threadbits_forum_126']/tr[6]/td[@id='td_threadtitle_944468']/div[1]/a[@id='thread_title_944468']')
                                                                                                                                                                                       ^
SyntaxError: invalid syntax

I have tried messing around with that (using wild cards where the elements are not constant) and have been receiving syntax errors each time I try

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T04:30:13+00:00

One reason that could be causing the issue is that you are using a BaseSpider which does not implement rules.

Try changing BaseSpider to CrawlSpider. You should also rename parse to something like parse_item (since CrawlSpider implements a parse function), which will necessitate explicitly setting a callback in your rule. Eg:

rules = (
    # Extract links matching 'f126/xxx'
    # and follow links from them (since no callback means follow=True by default).
    Rule(SgmlLinkExtractor(allow=('index\d+\.html', )), callback='parse_item'),
)

An updated Xpath to try is as follows. Note that this will include all of the sticky threads, so it’s left as an exercise to the OP to work out how to filter out those out.

ads = hxs.select("//td[substring(@id, 1, 15) = 'td_threadtitle_']/div/a")

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

EDIT: THIS HAS BEEN RESOLVED! XPATH WAS THE ISSUE. I’m very confused. I’m trying

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply