I’m working on a crawler and need to understand exactly what is meant by

Question

0

Asked: May 18, 20262026-05-18T09:41:48+00:00 2026-05-18T09:41:48+00:00

I’m working on a crawler and need to understand exactly what is meant by

0

I’m working on a crawler and need to understand exactly what is meant by “link depth”. Take nutch for example: http://wiki.apache.org/nutch/NutchTutorial

depth indicates the link depth from the root page that should be
crawled.

So, say I have the domain www.domain.com and wanted to crawl a depth of, say, 3 — what do I need to do? If a site could be represented as a binary tree, then it wouldn’t be a problem I think.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-18T09:41:49+00:00

Link depth means the number of “hops” a page is be away from the root, where a “hop” means following a link on a page. The reason Nutch has this restriction is that links very “far away” from the main page are unlikely to hold much information (the main page will link to the most important information, so the farther you get, the more detailed info you find), while there can be very many of them, so they take up lots of storage space, computing time for ranking, and bandwidth.

Nutch thus uses an algorithm scheme known as depth-limited search to bound its running time and space usage. If it didn’t use this heuristic, it would have to crawl an entire site to rank all the pages in it and find the top N.

To crawl to depth 3, implement this algorithm and give it a depth bound of three. The nice thing about depth-limited search is that it’s a variant of depth-first search (DFS), so it’s quite space-efficient:

function depth-limited-crawl(page p, int d)
    if d == 0
        return
    /* do something with p, store it or so */
    foreach (page l in links(p))
        depth-limited-crawl(linked, d-1)

And no, a site cannot in general be represented as a binary tree; it’s a directed graph. If you somehow remove the backlinks, then it becomes a multiway tree. Either way, many sites are too large to store for your crawler.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m working on a crawler and need to understand exactly what is meant by

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply