Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 4055434
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 20, 20262026-05-20T14:42:03+00:00 2026-05-20T14:42:03+00:00

I have been reading up on web crawling and got a list full of

  • 0

I have been reading up on web crawling and got a list full of considerations, however there is one concern that I have not found any discussion about yet.

How often should robots.txt be fetched for any given site?

My scenario is, for any specific site, a very slow crawl with maybe 100 pages a day.
Lets say a website adds a new section(/humans-only/) which other pages link to. And at the same time add the appropriate line in robots.txt. A spider might find links to this section before updating robots.txt.

Funny how writing down a problem gives the solution.
When formulating my question above I got an idea of a solution.

The robots.txt can be updated rarely, like once a day.
But all new found links should be placed on hold in a queue until the next update of robots.txt. After robots.txt has been updated all pending links that passes can now be crawled.

Any other ideas or practical experience with this?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-20T14:42:03+00:00Added an answer on May 20, 2026 at 2:42 pm

    All large-scale Web crawlers cache robots.txt for some period of time. One day is pretty common, and in the past I’ve seen times as long as a week. Our crawler has a maximum cache time of 24 hours. In practice, it’s typically less than that except for sites that we crawl very often.

    If you hold links to wait for a future version of robots.txt, then you’re adding an artificial 24-hour latency to your crawl. That is, if you crawl my site today then you have to hold all those links for up to 24 hours before you download my robots.txt file again and verify that the links you crawled were allowed at the time. And you could be wrong as often as you’re right. Let’s say the following happens:

    2011-03-08 06:00:00 - You download my robots.txt
    2011-03-08 08:00:00 - You crawl the /humans-only/ directory on my site
    2011-03-08 22:00:00 - I change my robots.txt to restrict crawlers from accessing /humans-only/
    2011-03-09 06:30:00 - You download my robots.txt and throw out the /humans-only/ links.
    

    At the time you crawled, you were allowed to access that directory, so there was no problem with you publishing the links.

    You could use the last modified date returned by the Web server when you download robots.txt to determine if you were allowed to read those files at the time, but a lot of servers lie when returning the last modified date. Some large percentage (I don’t remember what it is) always return the current date/time as the last modified date because all of their content, including robots.txt, is generated at access time.

    Also, adding that restriction to your bot means that you’ll have to visit their robots.txt file again even if you don’t intend to crawl their site. Otherwise, links will languish in your cache. Your proposed technique raises a lot of issues that you can’t handle gracefully. Your best bet is to operate with the information you have at hand.

    Most site operators understand about robots.txt caching, and will look the other way if your bot hits a restricted directory on their site within 24 hours of a robots.txt change. provided, of course, that you didn’t read robots.txt and then go ahead and crawl the restricted pages. Of those few who question the behavior, a simple explanation of what happened is usually sufficient.

    As long as you’re open about what your crawler is doing, and you provide a way for site operators to contact you, most misunderstandings are easily corrected. There are a few–a very few–people who will accuse you of all kinds of nefarious activities. Your best bet with them is to apologize for causing a problem and then block your bot from ever visiting their sites.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have been reading up on this, and it seems that if you use
I have been reading through the C++ FAQ and was curious about the friend
I have been reading the MSDN documentation on subclassing and I have been successful
I have been reading about the differences between Table Variables and Temp Tables and
I have been reading the proper article in MSDN, Strong-Named Assemblies and a related
I have been reading through the CodePlex supported open source licenses, i couldn't quite
On Stackers' recommendation, I have been reading Crockford's excellent Javascript: The Good Parts .
OK, I have just been reading and trying for the last hour to import
I have been hearing and reading about Agile for years. I own a book
I have been doing a little reading on Flow Based Programming over the last

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.