Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6776417
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 26, 20262026-05-26T16:01:38+00:00 2026-05-26T16:01:38+00:00

I started with scrapy some days ago, learned about scraping particular sites, ie the

  • 0

I started with scrapy some days ago, learned about scraping particular sites, ie the dmoz.org example; so far it’s fine and i like it. As I want to learn about search engine development I aim to build a crawler (and storage, indexer etc) for large amount of websites of any “color” and content.

So far I also tried the depth-first-order and bredth-first-order crawling.

I use at the moment just one Rule, I set some path to skip and some domains.

Rule(SgmlLinkExtractor(deny=path_deny_base, deny_domains=deny_domains),
        callback='save_page', follow=True),

I have one pipeline, a mysql storage to store url, body and headers of the downloaded pages, done via a PageItem with these fields.

My questions for now are:

  1. Is it fine to use item for simple storing of pages ?

  2. How does it work that the spider checks the database if a page is already crawled (in the last six months ie), it’s builtin somehow?

  3. Is there something like a blacklist for useless domains, ie. placeholder domains, link farms etc.?

There are many other issues like storage but I guess I stop here, just one more general search engine question

  • Is there a way to obtain crawl result data from other professional crawlers, of course it must be done by sending harddisks otherwise the data volume would be the same if I crawl them myself, (compressing left aside).
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-26T16:01:38+00:00Added an answer on May 26, 2026 at 4:01 pm

    I will try to answer only two of your questions:

    1. Is it fine to use item for simple storing of pages ?

    AFAIK, scrapy doesn’t care what you put into an Item’s Field. Only your pipeline will dealing be with them.

    1. How does it work that the spider checks the database if a page is already crawled (in the last six months ie), it’s builtin somehow?

    Scrapy has duplicates middleware, but it filters duplicates only in current session. You have to manually prevent scrapy to not crawl sites you’ve crawled six months ago.

    As for question 3 and 4 – you don’t understand them.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I started out with Django some days ago and I have now made a
I'm new to web scraping and just started experimenting with Scrapy , a scraping
Started learning Wicket after ASP.NET MVC and feel a little bit confused about managing
I started studying python a couple of month ago, then I found Jython. Do
I started reading more about ThreadPoolExecutor from Java Doc as I am using it
Just started looking at some MVC JS framework / library and most of them
Started to learn programming 2 months ago, in one of my little projects i
Started to learn some shell scripting. I have a perl script that runs on
Recently started working with the scrapy library. I am trying to scrape from a
Started using PDO prepared statements not too long ago, and, as i understand, it

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.