Hello Guys! I am going to write website crawler, which starts from root address then crawls every found link (only internal links). So I face this problem:
Crawler must start from root, then it should parse web page (root page) and then get all links. While getting links, it should not crawl the same page twice. Guys is there any good data structure or do I need to use SQL or someother indexing data structures?
Hello Guys! I am going to write website crawler, which starts from root address
Share
I’d recommend checking out my answer here: Designing a web crawler and How can I bring google-like recrawling in my application(web or console)
I’ve answered a lot of the questions you’re asking. The key thing to walk away with is that crawlers use a URL-Seen Test in order to efficiently determine if they should crawl a link. The URL-Seen Test is usually implemented using a map structure which quickly resolves keys (urls). Commonly used solutions are embedded databases such as leveldb, berkeleydb, and other NOSQL solutions.