I have a mapreduce job whose role is to split my input file into two files according to a given criterion.
I am currently using Hadoop r0.20.203 because it is the current stable version
This version offers two APIs :
- The old/deprecated one (org.apache.hadoop.mapred)
- The new one (org.apache.hadoop.mapreduce)
As you can imagine, I am using the new API, and my problem is that Hadoop r0.20.203 does not offer any MultipleOutput formats in the new API.
Hadoop 0.20.203 stills offers MultipleTextOutputFormat and MultipleTextOutputs (which are both suitable for my case) in the old API. Moreover, the newer Hadoop 0.22 offers MultipleOutputs in the new API.
I see four solutions to my problem :
- Switch to Hadoop 0.22. The problem with this solution is that the version may not be deployed on the clusters I’m using because of its beta status.
- Use the old API for this specific job and the new one for the others. I have seen that the old API has been undeprecated in Hadoop 1.0.0, so can it still be used ? If I need to switch to a newer Hadoop version later, I would have only this job to rewrite.
- Use the old API for all my jobs to avoid compatibility/consistency problems. Do you think it could harm the evolution of my program ? Especially if I need to switch to a newer Hadoop version later.
- Forget about multiple outputs and find another solution.
What would you do if you were me ?
Why don’t you put the source code in your project and use it?
http://grepcode.com/file_/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.java/?v=source
It should be compatible with r0.20.203, actually I don’t see classes which should not be available in the older version.
And there is really nothing magic about it, it just setup’s several record writers for each configured output (type and stuff). I bet that you could have written your own in the time of formulating the question