I want to sort a big dataset efficiently (i.e. with a custom partitioner, like

Question

0

Asked: May 23, 20262026-05-23T21:27:57+00:00 2026-05-23T21:27:57+00:00

I want to sort a big dataset efficiently (i.e. with a custom partitioner, like

0

I want to sort a big dataset efficiently (i.e. with a custom partitioner, like described here: How does the MapReduce sort algorithm work?), but I want to do it with hive.

However, the Hive manual states that “order by” is performed by a single reducer.
This surprises me, as pig does implement something similar to the article – pig impl

Am I missing something, or is it that hive simply isn’t the right hammer for this job?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T21:27:58+00:00

I think that Hive is not right tool for the job. At least for now. It is built to be used as OLAP/Report tool and thereof is not optimized to produce large result datasets, since most of the analytical queries produce relatively small result set. As a result – they have good TOP N capability but not good total order.

Just in case if you didn’t encounter it before – I am suggesting to look inte Hadoop’s terasort example, which is specifically aimed to sort large dataset in a best possible way using MR. http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/examples/terasort/package-summary.html

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I want to sort a big dataset efficiently (i.e. with a custom partitioner, like

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply