This is the code for Spyder1 that I’ve been trying to write within Scrapy

Question

0

Asked: May 13, 20262026-05-13T00:43:08+00:00 2026-05-13T00:43:08+00:00

This is the code for Spyder1 that I’ve been trying to write within Scrapy

0

This is the code for Spyder1 that I’ve been trying to write within Scrapy framework:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from firm.items import FirmItem

class Spider1(CrawlSpider):
    domain_name = 'wc2'
    start_urls = ['http://www.whitecase.com/Attorneys/List.aspx?LastName=A']
    rules = (
        Rule(SgmlLinkExtractor(allow=["hxs.select(
            '//td[@class='altRow'][1]/a/@href').re('/.a\w+')"]), 
            callback='parse'),
    )

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        JD = FirmItem()
        JD['school'] = hxs.select(
                   '//td[@class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)'
        )
        return JD    

SPIDER = Spider1()

The regex in the rules successfully pulls all the bio urls that I want from the start url:

>>> hxs.select(
...             '//td[@class="altRow"][1]/a/@href').re('/.a\w+')
[u'/cabel', u'/jacevedo', u'/jacuna', u'/aadler', u'/zahmedani', u'/tairisto', u
'/zalbert', u'/salberts', u'/aaleksandrova', u'/malhadeff', u'/nalivojvodic', u'
/kallchurch', u'/jalleyne', u'/lalonzo', u'/malthoff', u'/valvarez', u'/camon',
u'/randerson', u'/eandreeva', u'/pangeli', u'/jangland', u'/mantczak', u'/darany
i', u'/carhold', u'/marora', u'/garrington', u'/jartzinger', u'/sasayama', u'/ma
sschenfeldt', u'/dattanasio', u'/watterbury', u'/jaudrlicka', u'/caverch', u'/fa
yanruoh', u'/razar']
>>>

But when I run the code I get

[wc2] ERROR: Error processing FirmItem(school=[]) - 

[Failure instance: Traceback: <type 'exceptions.IndexError'>: list index out of range

This is the FirmItem in Items.py

from scrapy.item import Item, Field

class FirmItem(Item):
    school = Field()

    pass

Can you help me understand where the index error occurs?

It seems to me that it has something to do with SgmLinkExtractor.

I’ve been trying to make this spider work for weeks with Scrapy. They have an excellent tutorial but I am new to python and web programming so I don’t understand how for instance SgmlLinkExtractor works behind the scene.

Would it be easier for me to try to write a spider with the same simple functionality with Python libraries? I would appreciate any comments and help.

Thanks

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-13T00:43:08+00:00

Editorial Team

2026-05-13T00:43:08+00:00Added an answer on May 13, 2026 at 12:43 am

SgmlLinkExtractor doesn’t support selectors in its “allow” argument.

So this is wrong:

SgmlLinkExtractor(allow=["hxs.select('//td[@class='altRow'] ...')"])

This is right:

SgmlLinkExtractor(allow=[r"product\.php"])

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

This is the code for Spyder1 that I’ve been trying to write within Scrapy

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply