I have some experience with MySQL and recently I have to do some work on HIVE instead.
The basic structure of the queries is quite similar between the two, but the GROUP BY in HIVE seems to work a bit differently… Thus I cannot achieve what I could previously achieve in MySQL using GROUP BY.
Following is my question, so say I have a table with column A, B, C, and I want to select the rows with max. B column values grouping by column A. I will do:
SELECT A, max(B) FROM myTable GROUP BY A
The above code would work in HIVE with no problem. But what if I also want to see the value in column C which is in the same row of the row with max. B value? In MySQL I can just do:
SELECT A, max(B), C FROM myTable GROUP BY A
But in HIVE I can’t do this. It complains that C is not in the GROUP BY keys, but if I add C into GROUP BY, the result is totally not what I want.
So what is the way to select such desired result in HIVE? Some say using collect_set on column C can solve the problem, but I have no idea how the collect_set is ordered and thus don’t know which element to return…
Okay I figured this out… The following would do the trick:
It turns out that I have to write much more code in HIVE to get the same result as I would get with just one line in MySQL… 🙁