I want to sort a big dataset efficiently (i.e. with a custom partitioner, like described here: How does the MapReduce sort algorithm work?), but I want to do it with hive.
However, the Hive manual states that “order by” is performed by a single reducer.
This surprises me, as pig does implement something similar to the article – pig impl
Am I missing something, or is it that hive simply isn’t the right hammer for this job?
I think that Hive is not right tool for the job. At least for now. It is built to be used as OLAP/Report tool and thereof is not optimized to produce large result datasets, since most of the analytical queries produce relatively small result set. As a result – they have good TOP N capability but not good total order.
Just in case if you didn’t encounter it before – I am suggesting to look inte Hadoop’s terasort example, which is specifically aimed to sort large dataset in a best possible way using MR. http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/examples/terasort/package-summary.html