I have a Scrapy Crawler that crawls some guides from a forum.
The forum that I’m trying to crawl the data has got a number of pages.
The problem is that I cannot extract the links that I want to because there aren’t specific classes or ids to select.
The url structure is like this one: http://www.guides.com/forums/forumdisplay.php?f=108&order=desc&page=1
Obviously I can change the number after desc&page=1 to 2, 3, 4 and so on but I would like to know what is the best choice to do this.
How can I accomplish that?
PS: This is the spider code
http://dpaste.com/hold/794794/
I can’t seem to open the forum URL (always redirects me to another website), so here’s a best effort suggestion:
If there are links to the other pages on the thread page, you can create a crawler rule to explicitly follow these links. Use a CrawlSpider for that:
The spider should automatically deduplicate requests, i.e. it won’t follow the same URL twice even if two pages link to it. If there are very similar URLs on the page with only one or two query arguments different (say,
order=asc), you can specifydeny=(...)in the Rule constructor to filter them out.