Why is Reduce required in MapReduce? If a job such as counting the number of words in a book will result in the same outcome if performed by a single process or MapReduced over a farm of servers, what possibility exists that duplicates will have to be removed? I’m assuming the Reduce step, at least in this example, would simply SUM the outcome from each worker process and deliver the total count of words in the book. I don’t understand where duplicates of anything come into the picture.
Share
The Reduce step is not meant for removing duplicates (although that is a possible use case in some situations). Reduce is meant for aggregation of outputs from various mappers with the same key.
For example in the word count example, Node 1 might get 10 instances of a word, say “school” and Node 2 might have 15 instances, Node 3 12 instances. Now how will the sum be calculated? The results 10, 15 and 12 are on different nodes. There is a shuffle phase which brings all these values to one node (which is the reducer allotted to the key “school” by the partitioner). Then the reducer will have all the values for this key, and it can sum them up.
EDIT: As Tudor mentioned, by aggregation I mean aggregation in the more general sense of “bringing together”
EDIT2: To clarify RaffiM’s doubt:
Continuing the above example, let’s say Node 1 had pages 1-10, Node 2 had pages 11-20 and Node 3 got pages 21-30. So, after the mao phase, we know that Pages 1-10 have the word “school” 10 times, pages 11-20 have that word appear 15 times and likewise, 15 times for pages 21-30. Now what we need is the total number of times the word appears in the whole book, so we need to still add these up. We need 10+15+12+the numbers for the other page ranges…
If you don’t use the combiner, the mapper just sends “1” for each time the word appears. So for pages 1-10, it will send <“school”,1> as the output key-value 10 times. To make it more efficient, we use the combiner which sums it up at the mapper level. So if you use the combiner, it will consume this in Node 1 itself and generate a consolidated output <“school”, 10> for Node 1.