I started with scrapy some days ago, learned about scraping particular sites, ie the dmoz.org example; so far it’s fine and i like it. As I want to learn about search engine development I aim to build a crawler (and storage, indexer etc) for large amount of websites of any “color” and content.
So far I also tried the depth-first-order and bredth-first-order crawling.
I use at the moment just one Rule, I set some path to skip and some domains.
Rule(SgmlLinkExtractor(deny=path_deny_base, deny_domains=deny_domains),
callback='save_page', follow=True),
I have one pipeline, a mysql storage to store url, body and headers of the downloaded pages, done via a PageItem with these fields.
My questions for now are:
-
Is it fine to use item for simple storing of pages ?
-
How does it work that the spider checks the database if a page is already crawled (in the last six months ie), it’s builtin somehow?
-
Is there something like a blacklist for useless domains, ie. placeholder domains, link farms etc.?
There are many other issues like storage but I guess I stop here, just one more general search engine question
- Is there a way to obtain crawl result data from other professional crawlers, of course it must be done by sending harddisks otherwise the data volume would be the same if I crawl them myself, (compressing left aside).
I will try to answer only two of your questions:
AFAIK, scrapy doesn’t care what you put into an Item’s Field. Only your pipeline will dealing be with them.
Scrapy has duplicates middleware, but it filters duplicates only in current session. You have to manually prevent scrapy to not crawl sites you’ve crawled six months ago.
As for question 3 and 4 – you don’t understand them.