My first question here :) I was trying to crawl my schools website for

Question

0

Asked: June 15, 20262026-06-15T13:14:22+00:00 2026-06-15T13:14:22+00:00

My first question here :) I was trying to crawl my schools website for

0

My first question here 🙂

I was trying to crawl my schools website for all possible webpages there are. But I cannot get the links into a text file. I have the right permissions, so that is not the problem.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.spider import BaseSpider

class hsleidenSpider(CrawlSpider):
        name = "hsleiden1"
        allowed_domains = ["hsleiden.nl"]
        start_urls = ["http://hsleiden.nl"]

        # allow=() is used to match all links
        rules = [
        Rule(SgmlLinkExtractor(allow=()), follow=True),
        Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
        ]

        def parse_item(self, response):
                x = HtmlXPathSelector(response)

                filename = "hsleiden-output.txt"
                open(filename, 'ab').write(response.url)

So I am only scanning on the hsleiden.nl page. And I would like to have the response.url into the textfile hsleiden-output.txt.

Is there any way to do this right?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-15T13:14:24+00:00

With reference to the documentation for CrawlSpider, if multiple rules match the same link then only the first will be used.

Thus, as a result of redirects, using the first rule results in a seemingly infinite loop. Since the second rule is ignored, none of the matching links are ever passed to the parse_item callback, which means no output file.

Some investigation is required to fix the redirect issue (and to modify the first rule so that it doesn’t clash with the second), but commenting it out entirely will produce an output file of links like so:

http://www.hsleiden.nl/activiteitenkalenderhttp://www.hsleiden.nlhttp://www.hsleiden.nl/vind-je-studie/proefstuderenhttp://www.hsleiden.nl/studiumgenerale

etc

They were all munged together on a single line, so you might want to add a newline character or separator each time you write to the output file.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

My first question here :) I was trying to crawl my schools website for

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply