I am trying to understand as to why does map-reduce does an implicit sorting during the shuffle and sort phase both on the map side and the reduce side which is manifested as a mixture of both in-memory as well as on-disk sorting (can be really expensive for large sets of data).
My concern is that while running map-reduce jobs, performance is a significant consideration and an implicit sorting based on the keys before throwing the output of the mapper to the reducer will have a great impact on the performance when dealing with large sets of data.
I understand that sorting can prove to be a boon in certain cases where it is explicitly required but this is not always true? So, why does the concept of implicit sorting exist in Hadoop Map-Reduce?
For any kind of reference to what I am talking about while mentioning the shuffle and sort phase feel free to give a brief reading to the post : Map-Reduce: Shuffle and Sort on my blog: Hadoop-Some Salient Understandings
One of the possible explanation to the above which came to my mind much later after posting this question is:
The sorting is done just to aggregate all the records corresponding to a particular key, together, so that all these records corresponding to that single key maybe sent to a single reducer (default partitioning logic in Hadoop Map-Reduce). So, it may be said that by sorting all the records by the keys after the Mapper phase just allows to bring all records corresponding to a single key together where the order of the keys in sorted order may just get used for certain use cases such as sorting large sets of data.
If people can verify the above if they think the same, it shall be great. Thanks.