I have an external table with one column – data, where the data is json object
when I’m running the following hive query
hive> select get_json_object(data, "$.ev") from data_table limit 3;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201212171824_0218, Tracking URL = http://master:50030/jobdetails.jsp?jobid=job_201212171824_0218
Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=master:8021 -kill job_201212171824_0218
2013-01-24 10:41:37,271 Stage-1 map = 0%, reduce = 0%
....
2013-01-24 10:41:55,549 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201212171824_0218
OK
2
2
2
Time taken: 21.449 seconds
But when I’m running the sum aggregation the result is strange
hive> select sum(get_json_object(data, "$.ev")) from data_table limit 3;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201212171824_0217, Tracking URL = http://master:50030/jobdetails.jsp?jobid=job_201212171824_0217
Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=master:8021 -kill job_201212171824_0217
2013-01-24 10:39:24,485 Stage-1 map = 0%, reduce = 0%
.....
2013-01-24 10:41:00,760 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201212171824_0217
OK
9.4031522E7
Time taken: 100.416 seconds
Could anyone explain me why is that? And what should I do in for that works properly?
Hive seems to be taking the values in your JSON as
floats instead ofints, and it looks like your table is pretty big so Hive is probably using the “exponent” notation for big float numbers, so9.4031522E7probably means94031522.If you want to make sure you’re doing a
sumover int, you can cast the field of your JSON to int and the sum should be able to return you an int: