I was working with ArrayWritable, at some point I needed to check how Hadoop serializes the ArrayWritable, this is what I got by setting job.setNumReduceTasks(0):
0 IntArrayWritable@10f11b8
3 IntArrayWritable@544ec1
6 IntArrayWritable@fe748f
8 IntArrayWritable@1968e23
11 IntArrayWritable@14da8f4
14 IntArrayWritable@18f6235
and this is the test mapper that I was using:
public static class MyMapper extends Mapper<LongWritable, Text, LongWritable, IntArrayWritable> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
int red = Integer.parseInt(value.toString());
IntWritable[] a = new IntWritable[100];
for (int i =0;i<a.length;i++){
a[i] = new IntWritable(red+i);
}
IntArrayWritable aw = new IntArrayWritable();
aw.set(a);
context.write(key, aw);
}
}
IntArrayWritable is taken from the example given in the javadoc: ArrayWritable.
import org.apache.hadoop.io.ArrayWritable;
import org.apache.hadoop.io.IntWritable;
public class IntArrayWritable extends ArrayWritable {
public IntArrayWritable() {
super(IntWritable.class);
}
}
I actually checked on the source code of Hadoop and this makes no sense to me.
ArrayWritable should not serialize the class name and there is no way that an array of 100 IntWritable can be serialized using 6/7 hexadecimal values. The application actually seems to work just fine and the reducer deserializes the right values…
What is happening? What am I missing?
The problem is that the output you are getting from your MapReduce job is not the serialized version of that data. It is something that is translated into a pretty printed string.
When you set the number of reducers to zero, your mappers now get passed through a output format, which will format your data, likely converting it to a readable string. It does not dump it out serialized as if it was going to be picked up by a reducer.