I have come across an interview question If you were designing a web crawler,

Question

0

Asked: May 21, 20262026-05-21T17:23:17+00:00 2026-05-21T17:23:17+00:00

I have come across an interview question If you were designing a web crawler,

0

I have come across an interview question “If you were designing a web crawler, how would you avoid getting into infinite loops? ” and I am trying to answer it.

How does it all begin from the beginning.
Say Google started with some hub pages say hundreds of them (How these hub pages were found in the first place is a different sub-question).
As Google follows links from a page and so on, does it keep making a hash table to make sure that it doesn’t follow the earlier visited pages.

What if the same page has 2 names (URLs) say in these days when we have URL shorteners etc..

I have taken Google as an example. Though Google doesn’t leak how its web crawler algorithms and page ranking etc work, but any guesses?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-21T17:23:17+00:00

If you want to get a detailed answer take a look at section 3.8 this paper, which describes the URL-seen test of a modern scraper:

In the course of extracting links, any
Web crawler will encounter multiple
links to the same document. To avoid
downloading and processing a document
multiple times, a URL-seen test must
be performed on each extracted link
before adding it to the URL frontier.
(An alternative design would be to
instead perform the URL-seen test when
the URL is removed from the frontier,
but this approach would result in a
much larger frontier.)

To perform the
URL-seen test, we store all of the
URLs seen by Mercator in canonical
form in a large table called the URL
set. Again, there are too many entries
for them all to fit in memory, so like
the document fingerprint set, the URL
set is stored mostly on disk.

To save
space, we do not store the textual
representation of each URL in the URL
set, but rather a fixed-sized
checksum. Unlike the fingerprints
presented to the content-seen test’s
document fingerprint set, the stream
of URLs tested against the URL set has
a non-trivial amount of locality. To
reduce the number of operations on the
backing disk file, we therefore keep
an in-memory cache of popular URLs.
The intuition for this cache is that
links to some URLs are quite common,
so caching the popular ones in memory
will lead to a high in-memory hit
rate.

In fact, using an in-memory
cache of 2^18 entries and the LRU-like
clock replacement policy, we achieve
an overall hit rate on the in-memory
cache of 66.2%, and a hit rate of 9.5%
on the table of recently-added URLs,
for a net hit rate of 75.7%. Moreover,
of the 24.3% of requests that miss in
both the cache of popular URLs and the
table of recently-added URLs, about
1=3 produce hits on the buffer in our
random access file implementation,
which also resides in user-space. The
net result of all this buffering is
that each membership test we perform
on the URL set results in an average
of 0.16 seek and 0.17 read kernel
calls (some fraction of which are
served out of the kernel’s file system
buffers). So, each URL set membership
test induces one-sixth as many kernel
calls as a membership test on the
document fingerprint set. These
savings are purely due to the amount
of URL locality (i.e., repetition of
popular URLs) inherent in the stream
of URLs encountered during a crawl.

Basically they hash all of the URLs with a hashing function that guarantees unique hashes for each URL and due to the locality of URLs, it becomes very easy to find URLs. Google even open-sourced their hashing function: CityHash

WARNING!
They might also be talking about bot traps!!! A bot trap is a section of a page that keeps generating new links with unique URLs and you will essentially get trapped in an "infinite loop" by following the links that are being served by that page. This is not exactly a loop, because a loop would be the result of visiting the same URL, but it’s an infinite chain of URLs which you should avoid crawling.

Update 12/13/2012– the day after the world was supposed to end 🙂

Per Fr0zenFyr’s comment: if one uses the AOPIC algorithm for selecting pages, then it’s fairly easy to avoid bot-traps of the infinite loop kind. Here is a summary of how AOPIC works:

Get a set of N seed pages.
Allocate X amount of credit to each page, such that each page has X/N credit (i.e. equal amount of credit) before crawling has started.
Select a page P, where the P has the highest amount of credit (or if all pages have the same amount of credit, then crawl a random page).
Crawl page P (let’s say that P had 100 credits when it was crawled).
Extract all the links from page P (let’s say there are 10 of them).
Set the credits of P to 0.
Take a 10% "tax" and allocate it to a Lambda page.
Allocate an equal amount of credits each link found on page P from P’s original credit – the tax: so (100 (P credits) – 10 (10% tax))/10 (links) = 9 credits per each link.
Repeat from step 3.

Since the Lambda page continuously collects tax, eventually it will be the page with the largest amount of credit and we’ll have to "crawl" it. I say "crawl" in quotes, because we don’t actually make an HTTP request for the Lambda page, we just take its credits and distribute them equally to all of the pages in our database.

Since bot traps only give internal links credits and they rarely get credit from the outside, they will continually leak credits (from taxation) to the Lambda page. The Lambda page will distribute that credits out to all of the pages in the database evenly and upon each cycle the bot trap page will lose more and more credits, until it has so little credits that it almost never gets crawled again. This will not happen with good pages, because they often get credits from back-links found on other pages. This also results in a dynamic page rank and what you will notice is that any time you take a snapshot of your database, order the pages by the amount of credits they have, then they will most likely be ordered roughly according to their true page rank.

This only avoid bot traps of the infinite-loop kind, but there are many other bot traps which you should watch out for and there are ways to get around them too.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have come across an interview question If you were designing a web crawler,

Leave an answerCancel reply

1 Answer

Update 12/13/2012– the day after the world was supposed to end 🙂

Leave an answer
Cancel reply