Good day…
I am a bit confused; what is the difference between a reduce task and a reduce job?
here is my case; I have read that reduce does not start until all mapping is finished…
but in the hadoop output I see otherwise:
12/02/11 10:58:50 INFO mapred.JobClient: map 60% reduce 16%
12/02/11 10:58:54 INFO mapred.JobClient: map 60% reduce 20%
12/02/11 10:58:55 INFO mapred.JobClient: map 65% reduce 20%
the reduce is 16% whilst the map is still 60%…
What is really happening here?
There are three phases of the “reduce phase”: shuffle, sort, reduce. The shuffle copies the data and the sort groups the keys together. The reduce is the actual
reducefunction that you wrote.The way the percentages work is shuffle is 33%, sort is 33%, and reduce is 33%. What you are seeing is “about 16%/33% (i.e., 48%) of the data has been copied over to the reducers”. The final 33% of “reduce” can’t start until all the mappers are done.