I started with scrapy some days ago, learned about scraping particular sites, ie the

Question

0

Asked: May 26, 20262026-05-26T16:01:38+00:00 2026-05-26T16:01:38+00:00

I started with scrapy some days ago, learned about scraping particular sites, ie the

0

I started with scrapy some days ago, learned about scraping particular sites, ie the dmoz.org example; so far it’s fine and i like it. As I want to learn about search engine development I aim to build a crawler (and storage, indexer etc) for large amount of websites of any “color” and content.

So far I also tried the depth-first-order and bredth-first-order crawling.

I use at the moment just one Rule, I set some path to skip and some domains.

Rule(SgmlLinkExtractor(deny=path_deny_base, deny_domains=deny_domains),
        callback='save_page', follow=True),

I have one pipeline, a mysql storage to store url, body and headers of the downloaded pages, done via a PageItem with these fields.

My questions for now are:

Is it fine to use item for simple storing of pages ?
How does it work that the spider checks the database if a page is already crawled (in the last six months ie), it’s builtin somehow?
Is there something like a blacklist for useless domains, ie. placeholder domains, link farms etc.?

There are many other issues like storage but I guess I stop here, just one more general search engine question

Is there a way to obtain crawl result data from other professional crawlers, of course it must be done by sending harddisks otherwise the data volume would be the same if I crawl them myself, (compressing left aside).

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T16:01:38+00:00

I will try to answer only two of your questions:

Is it fine to use item for simple storing of pages ?

AFAIK, scrapy doesn’t care what you put into an Item’s Field. Only your pipeline will dealing be with them.

How does it work that the spider checks the database if a page is already crawled (in the last six months ie), it’s builtin somehow?

Scrapy has duplicates middleware, but it filters duplicates only in current session. You have to manually prevent scrapy to not crawl sites you’ve crawled six months ago.

As for question 3 and 4 – you don’t understand them.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I started with scrapy some days ago, learned about scraping particular sites, ie the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply