I am new to hadoop here. It is not clear why we need to

Question

0

Asked: June 8, 20262026-06-08T20:44:49+00:00 2026-06-08T20:44:49+00:00

I am new to hadoop here. It is not clear why we need to

0

I am new to hadoop here. It is not clear why we need to be able to sort by keys while using hadoop mapreduce ? After map phase, we need to distribute the data corresponding to each unique key to some number of reducers. This can be done without having the need to sort it right ?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-08T20:44:50+00:00

It is there, because sorting is a neat trick to group your keys. Of course, if your job or algorithm does not need any order of your keys, then you will be faster to group by some hashing trick.

In Hadoop itself, there is already a JIRA filed for that since years (source).
Several other distributions that layer on top of Hadoop have these features already, Hanborq for example (they call it sort avoidance). (source)

To your actual question (Why), MapReduce was inherently a paper from Google (source) which states the following:

We guarantee that within a given partition, the intermediate key/value
pairs are processed in increasing key order. This ordering guarantee
makes it easy to generate a sorted output file per partition, which is
useful when the output file format needs to support efficient random
access lookups by key, or users of the output find it convenient to
have the data sorted.

So it was more a convenience decision to support sort, but not to inherently only allow sort to group keys.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am new to hadoop here. It is not clear why we need to

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply