I have a Scrapy Crawler that crawls some guides from a forum. The forum

Question

0

Asked: June 11, 20262026-06-11T00:58:49+00:00 2026-06-11T00:58:49+00:00

I have a Scrapy Crawler that crawls some guides from a forum. The forum

0

I have a Scrapy Crawler that crawls some guides from a forum.
The forum that I’m trying to crawl the data has got a number of pages.
The problem is that I cannot extract the links that I want to because there aren’t specific classes or ids to select.
The url structure is like this one: http://www.guides.com/forums/forumdisplay.php?f=108&order=desc&page=1
Obviously I can change the number after desc&page=1 to 2, 3, 4 and so on but I would like to know what is the best choice to do this.
How can I accomplish that?

PS: This is the spider code
http://dpaste.com/hold/794794/

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-11T00:58:51+00:00

I can’t seem to open the forum URL (always redirects me to another website), so here’s a best effort suggestion:

If there are links to the other pages on the thread page, you can create a crawler rule to explicitly follow these links. Use a CrawlSpider for that:

class GuideSpider(CrawlSpider):
    name = "Guide"
    allowed_domains = ['www.guides.com']
    start_urls = [
        "http://www.guides.com/forums/forumdisplay.php?f=108&order=desc&page=1",
    ]

    rules = [
        Rule(SgmlLinkExtractor(allow=("forumdisplay.php.*f=108.*page=",), callback='parse_item', follow=True)),
    ]

    def parse_item(self, response):
        # Your code
        ...

The spider should automatically deduplicate requests, i.e. it won’t follow the same URL twice even if two pages link to it. If there are very similar URLs on the page with only one or two query arguments different (say, order=asc), you can specify deny=(...) in the Rule constructor to filter them out.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a Scrapy Crawler that crawls some guides from a forum. The forum

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply