I’m trying to use the CombineFileInputFormat class using Yelp’s MrJob tool for EMR. The jobflow is created using hadoop streaming, and MrJob’s documentation indicates the CombineFileInputFormat class must be bundled in a customized hadoop-streaming.jar.
For context, please follow this question.
Specifically my question is: Where should the concrete class CombinedInputFormat.class be bundled or referenced within the hadoop-streaming.jar?
I have tried bundling the CombinedInputFormat.class by adding it to a directory org/apache/hadoop/streaming and executing:
jar uvf my-hadoop-streaming.jar org/apache/hadoop/streaming
If I do that, the streaming jobflow starts, with the option -inputformat CombinedInputFormat the Job starts the first step and breaks, with error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/streaming/CombinedInputFormat (wrong name: CombinedInputFormat)
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
...
If I just try to set it in the root path with:
jar uvf my-hadoop-streaming.jar CombinedInputFormat.class
The error I get is:
-inputformat : class not found : CombinedInputFormat
Streaming Job Failed!
How should I bundle the CombinedInputFormat.class so that it will be correctly taken and solve the NoClassDefFoundError error?
The class
CombinedInputFormatexplained here extendsCombineFileInputFormatand isn’t ported with hadoop. So what you need to do is, in the same package where you have you mapper/reducer job class, you have to CREATE a class and have the code stated in the earlier issue. Then create jar and it should run normally.So basically, you need to write your own implementation of
CombineFileInputFormat(which I did it for you) and you can name it anything you want, sayABCClassinstead ofCombinedInputFormatas I had named it.