What are my options for developing Java Map Reduce jobs in Eclipse? My final goal is to run my developed map/reduce logic on my amazon Hadoop cluster but I would like to test the logic on my local machine first and put break points in it before deploying it to a larger cluster.
I see there is a Hadoop Plug-in for Eclipse which looks old (correct me if I am wrong) and a company called Karmasphere had something for ecplise and Hadoop but I am not sure if that is still available.
How do you go about developing, testing and debugging your map/reduce job with Eclipse?
I develop Cassandra/Hadoop applications in Eclipse by:
Using maven (m2e) to gather and configure the dependencies (Hadoop, Cassandra, Pig, etc.) for my Eclipse projects
Creating test cases (classes in src/test/java) to test my mappers and reducers. The trick is to build a context object on the fly using inner classes that extend RecordWriter and StatusReporter. If you do this then after you invoke setup/map/cleanup or setup/reduce/cleanup you can assert the correct key/value pairs and context info were written by the mapper or reducer. The constructors for contexts in both mapred and mapreduce look ugly, but you’ll find the classes are pretty easy to instantiate.
Once you write these tests maven will invoke them automatically every time you build.
You can invoke the tests manually by selecting the project and doing a Run –> Maven Test. This turns out to be really handy because the tests are invoked in debug mode and you can set breakpoints in your mappers and reducers and do all the cool things Eclipse lets you do in debug.
Once you’re happy with the quality of your code, use Maven to build a jar-with-dependencies for that all in one jar that hadoop likes so much.
Just as a side note, I’ve built a number of code generation tools based on the M2T JET project in Eclipse. They generate out the infrastructure for everything I’ve mentioned above and I just write the logic for my mappers, reducers and test cases. I think if you gave it some thought you could probably come up with a set of reusable classes that you could extend to do pretty much the same thing.
Here’s a sample test case class:
and here’s a sample maven pom. Note that the referenced versions are a bit out of date, but as long as those versions are kept in a maven repository somewhere, you’ll be able to build this project.