Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7004427
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 27, 20262026-05-27T21:11:51+00:00 2026-05-27T21:11:51+00:00

What is a typical politeness factor for a web crawler? Apart from always obeying

  • 0

What is a typical politeness factor for a web crawler?

Apart from always obeying robot.txt
Both the “Disallow:” and non standard “Crawl-delay:”

But if a site does not specify an explicit crawl-delay what should the default value be set at?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-27T21:11:52+00:00Added an answer on May 27, 2026 at 9:11 pm

    The algorithm we use is:

    // If we are blocked by robots.txt
    // Make sure it is obeyed.
    // Our bots user-agent string contains a link to a html page explaining this.
    // Also an email address to be added to so that we never even consider their domain in the future
    
    // If we receive more that 5 consecutive responses with HTTP response code of 500+ (or timeouts)
    // Then we assume the domain is either under heavy load and does not need us adding to it.
    // Or the URL we are crawling are completely wrong and causing problems
    // Wither way we suspend crawling from this domain for 4 hours.
    
    // There is a non-standard parameter in robots.txt that defines a min crawl delay
    // If it exists then obey it.
    //
    //    see: http://www.searchtools.com/robots/robots-txt-elements.html
    double PolitenssFromRobotsTxt = getRobotPolitness();
    
    
    // Work Size politeness
    // Large popular domains are designed to handle load so we can use a
    // smaller delay on these sites then for smaller domains (thus smaller domains hosted by
    // mom and pops by the family PC under the desk in the office are crawled slowly).
    //
    // But the max delay here is 5 seconds:
    //
    //    domainSize => Range 0 -> 10
    //
    double workSizeTime = std::min(exp(2.52166863221 + -0.530185027289 * log(domainSize)), 5);
    //
    // You can find out how important we think your site is here:
    //      http://www.opensiteexplorer.org
    // Look at the Domain Authority and diveide by 10.
    // Note: This is not exactly the number we use but the two numbers are highly corelated
    //       Thus it will usually give you a fair indication.
    
    
    
    // Take into account the response time of the last request.
    // If the server is under heavy load and taking a long time to respond
    // then we slow down the requests. Note time-outs are handled above
    double responseTime = pow(0.203137637588 + 0.724386103344 * lastResponseTime, 2);
    
    // Use the slower of the calculated times
    double result = std::max(workSizeTime, responseTime);
    
    //Never faster than the crawl-delay directive
    result = std::max(result, PolitenssFromRobotsTxt);
    
    
    // Set a minimum delays
    // So never hit a site more than every 10th of a second
    result = std::max(result, 0.1);
    
    // The maximum delay we have is every 2 minutes.
    result = std::min(result, 120.0)
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Typical strlen() traverse from first character till it finds \0 . This requires you
A typical producer-consumer problem is solved in python like below: from queue import Queue
My typical application has a couple of textboxes, checkbuttons, radiobuttons, and so. I always
The typical way of selecting data is: select * from my_table But what if
Typical ISP setup. One server is the web server, another is the DB SQL
A typical code fragment obtained from the YouTube embed feature looks like this: <object
we have a typical web application stack. there are 120 selenium (webdriver) tests that
Typical way of creating a CSV string (pseudocode): Create a CSV container object (like
Typical approaches recommend reading the binary via FileStream and comparing it byte-by-byte. Would a
Typical jQuery over-use: $('button').click(function() { alert('Button clicked: ' + $(this).attr('id')); }); Which can be

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.