Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8191481
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 7, 20262026-06-07T03:55:46+00:00 2026-06-07T03:55:46+00:00

So usually for 20 node cluster submitting job to process 3GB(200 splits) of data

  • 0

So usually for 20 node cluster submitting job to process 3GB(200 splits) of data takes about 30sec and actual execution about 1m.
I want to understand what is the bottleneck in job submitting process and understand next quote

Per-MapReduce overhead is significant: Starting/ending MapReduce job costs time

Some process I’m aware:
1. data splitting
2. jar file sharing

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-07T03:55:49+00:00Added an answer on June 7, 2026 at 3:55 am

    A few things to understand about HDFS and M/R that helps understand this latency:

    1. HDFS stores your files as data chunk distributed on multiple machines called datanodes
    2. M/R runs multiple programs called mapper on each of the data chunks or blocks. The (key,value) output of these mappers are compiled together as result by reducers. (Think of summing various results from multiple mappers)
    3. Each mapper and reducer is a full fledged program that is spawned on these distributed system. It does take time to spawn a full fledged programs, even if let us say they did nothing (No-OP map reduce programs).
    4. When the size of data to be processed becomes very big, these spawn times become insignificant and that is when Hadoop shines.

    If you were to process a file with a 1000 lines content then you are better of using a normal file read and process program. Hadoop infrastructure to spawn a process on a distributed system will not yield any benefit but will only contribute to the additional overhead of locating datanodes containing relevant data chunks, starting the processing programs on them, tracking and collecting results.

    Now expand that to 100 of Peta Bytes of data and these overheads looks completely insignificant compared to time it would take to process them. Parallelization of the processors (mappers and reducers) will show it’s advantage here.

    So before analyzing the performance of your M/R, you should first look to benchmark your cluster so that you understand the overheads better.

    How much time does it take to do a no-operation map-reduce program on a cluster?

    Use MRBench for this purpose:

    1. MRbench loops a small job a number of times
    2. Checks whether small job runs are responsive and running efficiently on your cluster.
    3. Its impact on the HDFS layer is very limited

    To run this program, try the following (Check the correct approach for latest versions:

    hadoop jar /usr/lib/hadoop-0.20/hadoop-test.jar mrbench -numRuns 50
    

    Surprisingly on one of our dev clusters it was 22 seconds.

    Another issue is file size.

    If the file sizes are less than the HDFS block size then Map/Reduce programs have significant overhead. Hadoop will typically try to spawn a mapper per block. That means if you have 30 5KB files, then Hadoop may end up spawning 30 mappers eventually per block even if the size of file is small. This is a real wastage as each program overhead is significant compared to the time it would spend processing the small sized file.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Usually, on twitter tweet button, if I don't specify data-text it take the page
Usually, we only put the data we want to send as websocket.send() method's parameter,
In beginner tutorials, Node's non-blocking nature is usually demonstrated by showing a blocking example
So the first app that people usually build with SocketIO and Node is usually
I'm writing my own wrapper class for parsing XML data. Usually I use the
I usually find this as the first line in node.js scripts/modules as well as
Ok, so in topological sorting depending on the input data, there's usually multiple correct
I've been using node.js for about a year, and I always did that kind
Usually I worked with PostgreSQL and never had a problem, but now I need
Usually I prefer to write my own solutions for trivial problems because generally plugins

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.