Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8307997
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 8, 20262026-06-08T18:40:13+00:00 2026-06-08T18:40:13+00:00

I eager to know (and have to know) about the nutch and its algorithms

  • 0

I eager to know (and have to know) about the nutch and its algorithms (because it relates to my project) that it uses to fetch,classify,…(generally Crawling).
I read this material but its a little hard to understand.
Is there anyone who can explain this to me in a complete and easy-to-understand way?
thanks in advance.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-08T18:40:14+00:00Added an answer on June 8, 2026 at 6:40 pm

    Short Answer

    In short, they have developed a webcrawler designed to very efficiently crawl the web from a many computer environment (but which can also be run on a single computer).

    You can start crawling the web without actually needing to know how they implemented it.

    The page you reference describes how it is implemented.

    Technology behind it

    They make use of Hadoop which is an open source java project which is designed along the same lines of MapReduce. MapReduce is the technology Google uses to crawl and organize the web.

    I’ve attended a few lectures on MapReduce/Hadoop, and unfortunately, I don’t know if anyone at this time can explain it in a complete and easy-to-understand way (they’re kind of opposites).

    Take a look at the wikipedia page for MapReduce.

    The basic idea is to send a job to the Master Node, the Master breaks the work up into pieces and sends it (maps it) to the various Worker Nodes (other computers or threads) which perform their assigned sub-task, and then sends the sub-result back to Master.

    Once the Master Node gets all the sub-results (or some of the sub-results) it starts to combine them (reduce them) into the final answer.

    All of these tasks are done at the same time, and each computer is given the right amount of work to keep it occupied the whole time.

    How to Crawl

    Consists of 4 jobs:

    1. Generate
    2. Fetch
    3. Parse
    4. Update Database

    *Generate

    Start with a list of webpages containing the pages you want to start crawling from: The “Webtable”.

    The Master node sends all of the pages in that list to its slaves (but if two pages have the same domain they are sent to the same slave).

    The Slave takes its assigned webpage(s) and:

    1. Has this already been generated? If so, skip it.
    2. Normalize the URL since “http://www.google.com/” and “http://www.google.com/../” is actually the same webpage.
    3. return an initial score along with the webpage back to the Master.

    (the Master partitions the webpages when it sends it to its slaves so that they all finish at the same time)

    The Master now chooses the topN (maybe the user just wanted to start with 10 initial pages), and marks them as chosen in the webtable.

    *Fetch

    Master looks at each URL in the webtable, maps the ones which were marked onto slaves to process them.

    Slaves fetch each URL from the Internet as fast as the internet connection will let them, they have a queue for each domain.

    They return the URL along with the HTML text of the webpage to the Master.

    *Parse

    Master looks at each webpage in the webtable, if it is marked as fetched, it sends it to its slaves to parse it.

    The slave first checks to see if it was already parsed by a different slave, if so skips it.

    Otherwise, it parses the webpage and saves the result to webtable.

    *Update Database

    Master looks at each webpage in the webtable, sends the parsed rows to its slaves.

    The slaves receive these Parsed URLs, calculate a score for them based on the number of links away from those pages (and the text near those links), and sends the Urls and scores back to the Master (which is sorted by score when it gets back to the Master because of the Partitioning).

    The master calculates and updates the webpage scores based on the number of links to those pages from other ones.

    The master stores this all to the database.

    Repeat

    When the pages were parsed, the links out of those webpages were added into the webtable. You can now repeat this process on just pages you haven’t looked at yet to keep expanding your visited pages. Eventually you will reach most of the Internet after enough iterations of the four above steps.

    Conclusion

    MapReduce is a cool system.

    A lot of effort has been applied to make it as efficient as possible.

    They can handle computers breaking down in the middle of the job and reassigning the work to other slaves. They can handle some slaves being faster than others.

    The Master may decide to do the slaves’ tasks on its own machine instead of sending it out to a slave if it will be more efficient. The communication network is incredibly advanced.

    MapReduce lets you write simple code:

    Define a Mapper, an optional Partitioner, and a Reducer.

    Then let MapReduce figure out how best to do that with all the computer resources it has access to, even if it is a single computer with a slow internet connection, or a kila-cluster. (maybe even Mega-clusters).

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I want to know about the id3 metadata, I am eager in extracting the
I have Person entity which has composition with Location Entity @ManyToOne(fetch = FetchType.EAGER, cascade
Have several questions. Don't flame me - I'm newbie, but eager to know more.
I'm eager to know, how many package names on CRAN have two, three, N
I have a fairly large, sophisticated algorithm that uses a std::priority_queue . In some
I am eager to know how to print the output on a terminal screen
I am eager to know the difference between a const variable and a static
Does anyone know a way to determine if a Rails association has been eager
My entities are configured to use Eager Fetching. But, i have a scenario where
I have an instance of a class that I got from a Hibernate session.

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.