Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 4556588
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 21, 20262026-05-21T17:23:17+00:00 2026-05-21T17:23:17+00:00

I have come across an interview question If you were designing a web crawler,

  • 0

I have come across an interview question “If you were designing a web crawler, how would you avoid getting into infinite loops? ” and I am trying to answer it.

How does it all begin from the beginning.
Say Google started with some hub pages say hundreds of them (How these hub pages were found in the first place is a different sub-question).
As Google follows links from a page and so on, does it keep making a hash table to make sure that it doesn’t follow the earlier visited pages.

What if the same page has 2 names (URLs) say in these days when we have URL shorteners etc..

I have taken Google as an example. Though Google doesn’t leak how its web crawler algorithms and page ranking etc work, but any guesses?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-21T17:23:17+00:00Added an answer on May 21, 2026 at 5:23 pm

    If you want to get a detailed answer take a look at section 3.8 this paper, which describes the URL-seen test of a modern scraper:

    In the course of extracting links, any
    Web crawler will encounter multiple
    links to the same document. To avoid
    downloading and processing a document
    multiple times, a URL-seen test must
    be performed on each extracted link
    before adding it to the URL frontier.
    (An alternative design would be to
    instead perform the URL-seen test when
    the URL is removed from the frontier,
    but this approach would result in a
    much larger frontier.)

    To perform the
    URL-seen test, we store all of the
    URLs seen by Mercator in canonical
    form in a large table called the URL
    set. Again, there are too many entries
    for them all to fit in memory, so like
    the document fingerprint set, the URL
    set is stored mostly on disk.

    To save
    space, we do not store the textual
    representation of each URL in the URL
    set, but rather a fixed-sized
    checksum. Unlike the fingerprints
    presented to the content-seen test’s
    document fingerprint set, the stream
    of URLs tested against the URL set has
    a non-trivial amount of locality. To
    reduce the number of operations on the
    backing disk file, we therefore keep
    an in-memory cache of popular URLs.
    The intuition for this cache is that
    links to some URLs are quite common,
    so caching the popular ones in memory
    will lead to a high in-memory hit
    rate.

    In fact, using an in-memory
    cache of 2^18 entries and the LRU-like
    clock replacement policy, we achieve
    an overall hit rate on the in-memory
    cache of 66.2%, and a hit rate of 9.5%
    on the table of recently-added URLs,
    for a net hit rate of 75.7%. Moreover,
    of the 24.3% of requests that miss in
    both the cache of popular URLs and the
    table of recently-added URLs, about
    1=3 produce hits on the buffer in our
    random access file implementation,
    which also resides in user-space. The
    net result of all this buffering is
    that each membership test we perform
    on the URL set results in an average
    of 0.16 seek and 0.17 read kernel
    calls (some fraction of which are
    served out of the kernel’s file system
    buffers). So, each URL set membership
    test induces one-sixth as many kernel
    calls as a membership test on the
    document fingerprint set. These
    savings are purely due to the amount
    of URL locality (i.e., repetition of
    popular URLs) inherent in the stream
    of URLs encountered during a crawl.

    Basically they hash all of the URLs with a hashing function that guarantees unique hashes for each URL and due to the locality of URLs, it becomes very easy to find URLs. Google even open-sourced their hashing function: CityHash

    WARNING!
    They might also be talking about bot traps!!! A bot trap is a section of a page that keeps generating new links with unique URLs and you will essentially get trapped in an "infinite loop" by following the links that are being served by that page. This is not exactly a loop, because a loop would be the result of visiting the same URL, but it’s an infinite chain of URLs which you should avoid crawling.

    Update 12/13/2012– the day after the world was supposed to end 🙂

    Per Fr0zenFyr’s comment: if one uses the AOPIC algorithm for selecting pages, then it’s fairly easy to avoid bot-traps of the infinite loop kind. Here is a summary of how AOPIC works:

    1. Get a set of N seed pages.
    2. Allocate X amount of credit to each page, such that each page has X/N credit (i.e. equal amount of credit) before crawling has started.
    3. Select a page P, where the P has the highest amount of credit (or if all pages have the same amount of credit, then crawl a random page).
    4. Crawl page P (let’s say that P had 100 credits when it was crawled).
    5. Extract all the links from page P (let’s say there are 10 of them).
    6. Set the credits of P to 0.
    7. Take a 10% "tax" and allocate it to a Lambda page.
    8. Allocate an equal amount of credits each link found on page P from P’s original credit – the tax: so (100 (P credits) – 10 (10% tax))/10 (links) = 9 credits per each link.
    9. Repeat from step 3.

    Since the Lambda page continuously collects tax, eventually it will be the page with the largest amount of credit and we’ll have to "crawl" it. I say "crawl" in quotes, because we don’t actually make an HTTP request for the Lambda page, we just take its credits and distribute them equally to all of the pages in our database.

    Since bot traps only give internal links credits and they rarely get credit from the outside, they will continually leak credits (from taxation) to the Lambda page. The Lambda page will distribute that credits out to all of the pages in the database evenly and upon each cycle the bot trap page will lose more and more credits, until it has so little credits that it almost never gets crawled again. This will not happen with good pages, because they often get credits from back-links found on other pages. This also results in a dynamic page rank and what you will notice is that any time you take a snapshot of your database, order the pages by the amount of credits they have, then they will most likely be ordered roughly according to their true page rank.

    This only avoid bot traps of the infinite-loop kind, but there are many other bot traps which you should watch out for and there are ways to get around them too.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have come across this question: If process A contains a pointer to a
I have come across this line of legacy code, which I am trying to
I have come across a problem trying to pass some objects to a java
I have come across a dynamically written form which I would like to style
I have come across a lot of optimization tips which say that you should
I have come across the following type of code many a times, and I
I have come across what must be a common problem. When I have an
I have come across a quirky feature in Visual Studio and I was interested
I have come across a Silverlight project that needed to make use of a
We have come across a scenario where we need to track the setting and

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.