Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9035083
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 16, 20262026-06-16T08:40:09+00:00 2026-06-16T08:40:09+00:00

I am new here, but need to know the best way to do unit

  • 0

I am new here, but need to know the best way to do unit testing for programs written over Apache Hadoop. I know we can write unit test cases the jUnit way for the logic inside map and reduce methods. Also we can do the same for other logics involved, but this doesn’t guarantee that it is well tested and will work on actual running environment.

I have read about MRUnit, but it too seems to be something like what I have mentioned above, but in a more mature manner. But it too doesn’t run as real mapreduce job, but is a mocked one.

Any help would be appreciated.

Thanks.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-16T08:40:10+00:00Added an answer on June 16, 2026 at 8:40 am

    You certainly have other options. Slight googling and you would have got it yourself. Here I did that for you!

    Here is the text I’m pasting from: http://blog.cloudera.com/blog/2009/07/advice-on-qa-testing-your-mapreduce-jobs/

    Other than using traditional jUnit and MRUnit, you have following options:

    Local Job Runner Testing – Running MR Jobs on a Single Machine in a Single JVM

    Traditional unit tests and MRUnit should do a fairly sufficient job detecting bugs early, but neither will test your MR jobs with Hadoop. The local job runner lets you run Hadoop on a local machine, in one JVM, making MR jobs a little easier to debug in the case of a job failing.

    To enable the local job runner, set “mapred.job.tracker” to “local” and “fs.default.name” to “file:///some/local/path” (these are the default values).

    Remember, there is no need to start any Hadoop daemons when using the local job runner. Running bin/hadoop will start a JVM and will run your job for you. Creating a new hadoop-local.xml file (or mapred-local.xml and hdfs-local.xml if you’re using 0.20) probably makes sense. You can then use the –config parameter to tell bin/hadoop which configuration directory to use. If you’d rather avoid fiddling with configuration files, you can create a class that implements Tool and uses ToolRunner, and then run this class with bin/hadoop jar foo.jar com.example.Bar -D mapred.job.tracker=local -D fs.default.name=file:/// (args), where Bar is the Tool implementation.

    To start using the local job runner to test your MR jobs in Hadoop, create a new configuration directory that is local job runner enabled and invoke your job as you normally would, remembering to include the –config parameter, which points to a directory containing your local configuration files.

    The -conf parameter also works in 0.18.3 and lets you specify your hadoop-local.xml file instead of specifying a directory with –config. Hadoop will run the job happily. The difficulty with this form of testing is verifying that the job ran correctly. Note: you’ll have to ensure that input files are set up correctly and output directories don’t exist before running the job.

    Assuming you’ve managed to configure the local job runner and get a job running, you’ll have to verify that your job completed correctly. Simply basing success on exit codes isn’t quite good enough. At the very least, you’ll want to verify that the output of your job is correct. You may also want to scan the output of bin/hadoop for exceptions. You should create a script or unit test that sets up preconditions, runs the job, diffs actual output and expected output, and scans for raised exceptions. This script or unit test can then exit with the appropriate status and output specific messages explaining how the job failed.

    Note that the local job runner has a couple of limitations: only one reducer is supported, and the DistributedCache doesn’t work (a fix is in progress).

    Pseudo-distributed Testing – Running MR Jobs on a Single Machine Using Daemons

    The local job runner lets you run your job in a single thread. Running an MR job in a single thread is useful for debugging, but it doesn’t properly simulate a real cluster with several Hadoop daemons running (e.g., NameNode, DataNode, TaskTracker, JobTracker, SecondaryNameNode). A pseudo-distributed cluster is composed of a single machine running all Hadoop daemons. This cluster is still relatively easy to manage (though harder than local job runner) and tests integration with Hadoop better than the local job runner does.

    To start using a pseudo-distributed cluster to test your MR jobs in Hadoop, follow the aforementioned advice for using the local job runner, but in your precondition setup include the configuration and start-up of all Hadoop daemons. Then, to start your job, just use bin/hadoop as you would normally.

    Full Integration Testing – Running MR Jobs on a QA Cluster

    Probably the most thorough yet most cumbersome mechanism for testing your MR jobs is to run them on a QA cluster composed of at least a few machines. By running your MR jobs on a QA cluster, you’ll be testing all aspects of both your job and its integration with Hadoop.

    Running your jobs on a QA cluster has many of the same issues as the local job runner. Namely, you’ll have to check the output of your job for correctness. You may also want to scan the stdin and stdout produced by each task attempt, which will require collecting these logs to a central place and grepping them. Scribe is a useful tool for collecting logs, though it may be superfluous depending on your QA cluster.

    We find that most of our customers have some sort of QA or development cluster where they can deploy and test new jobs, try out newer versions of Hadoop, and practice upgrading clusters from one version of Hadoop to another. If Hadoop is a major part of your production pipeline, then creating a QA or development cluster makes a lot of sense, and repeatedly running jobs on it will ensure that changes to your jobs continue to get tested thoroughly. EC2 may be a good host for your QA cluster, as you can bring it up and down on demand. Take a look at our beta EC2 EBS Hadoop scripts if you’re interested in creating a QA cluster in EC2.

    You should choose QA practices based on the importance of QA for your organization and also on the amount of resources you have. Simply using a traditional unit-testing framework, MRUnit and the local job runner can test your MR jobs thoroughly in a simple way without using too many resources. However, running your jobs on a QA or development cluster is naturally the best way to fully test your MR jobs with the expenses and operational tasks of a Hadoop cluster.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm new to Django and Haystack... I need to know how can I order
I am new here, but I am having hard time figuring out how to
Related to my question here but not enough to open a new question. I
I'm new to PHP so maybe I am overlooking something here but the following:
Possible Duplicate: Programming java to determine a symmetrical word am new here, but I
New developer here,Im using the Custom Image Picker by ray wenderlich. But what I
I am new to all this, but here goes: There is an apple file
I'm reaaallly new at JSON, but here's what I got. I needed to create
New but keen jquery user here, please be gentle :) I am using the
I'm new to IDEA, but very familiar with Eclipse. I followed the instructions here

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.