Hive has this pretty nice Array type that is very useful in theory but when it comes to practice I found very little information on how to do any kind of opeartions with it.
We store a serie of numbers in an array type column and need to SUM them in a query, preferably from n-th to m-th element. Is it possible with standard HiveQL or does it require a UDF or customer mapper/reducer?
Note: we’re using Hive 0.8.1 in EMR environment.
I’d write a simple
UDFfor this purpose. You need to havehive-execin your build path.E.g In case of
Maven:A simple raw implementation would look like this:
Next, build a jar and load it in Hive shell:
Now you can use it to calculate the sum of the array you have.
E.g:
Let’s assume that you have an input file having tab-separated columns in it :
Load it into mytable:
Execute some queries then:
Sum it in range m,n where m=1, n=3
Or