Hello Guys! I am going to write website crawler, which starts from root address

Question

0

Asked: June 3, 20262026-06-03T13:59:23+00:00 2026-06-03T13:59:23+00:00

Hello Guys! I am going to write website crawler, which starts from root address

0

Hello Guys! I am going to write website crawler, which starts from root address then crawls every found link (only internal links). So I face this problem:
Crawler must start from root, then it should parse web page (root page) and then get all links. While getting links, it should not crawl the same page twice. Guys is there any good data structure or do I need to use SQL or someother indexing data structures?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-03T13:59:24+00:00

I’d recommend checking out my answer here: Designing a web crawler and How can I bring google-like recrawling in my application(web or console)

I’ve answered a lot of the questions you’re asking. The key thing to walk away with is that crawlers use a URL-Seen Test in order to efficiently determine if they should crawl a link. The URL-Seen Test is usually implemented using a map structure which quickly resolves keys (urls). Commonly used solutions are embedded databases such as leveldb, berkeleydb, and other NOSQL solutions.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Hello Guys! I am going to write website crawler, which starts from root address

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply