In the GenericUDAFCount.java:
@Description(name = "count",
value = "_FUNC_(*) - Returns the total number of retrieved rows, including "
+ "rows containing NULL values.\n"
+ "_FUNC_(expr) - Returns the number of rows for which the supplied "
+ "expression is non-NULL.\n"
+ "_FUNC_(DISTINCT expr[, expr...]) - Returns the number of rows for "
+ "which the supplied expression(s) are unique and non-NULL.")
but I don`t see any code to deal with the ‘distinct’ expression.
public static class GenericUDAFCountEvaluator extends GenericUDAFEvaluator {
private boolean countAllColumns = false;
private LongObjectInspector partialCountAggOI;
private LongWritable result;
@Override
public ObjectInspector init(Mode m, ObjectInspector[] parameters)
throws HiveException {
super.init(m, parameters);
partialCountAggOI =
PrimitiveObjectInspectorFactory.writableLongObjectInspector;
result = new LongWritable(0);
return PrimitiveObjectInspectorFactory.writableLongObjectInspector;
}
private GenericUDAFCountEvaluator setCountAllColumns(boolean countAllCols) {
countAllColumns = countAllCols;
return this;
}
/** class for storing count value. */
static class CountAgg implements AggregationBuffer {
long value;
}
@Override
public AggregationBuffer getNewAggregationBuffer() throws HiveException {
CountAgg buffer = new CountAgg();
reset(buffer);
return buffer;
}
@Override
public void reset(AggregationBuffer agg) throws HiveException {
((CountAgg) agg).value = 0;
}
@Override
public void iterate(AggregationBuffer agg, Object[] parameters)
throws HiveException {
// parameters == null means the input table/split is empty
if (parameters == null) {
return;
}
if (countAllColumns) {
assert parameters.length == 0;
((CountAgg) agg).value++;
} else {
assert parameters.length > 0;
boolean countThisRow = true;
for (Object nextParam : parameters) {
if (nextParam == null) {
countThisRow = false;
break;
}
}
if (countThisRow) {
((CountAgg) agg).value++;
}
}
}
@Override
public void merge(AggregationBuffer agg, Object partial)
throws HiveException {
if (partial != null) {
long p = partialCountAggOI.get(partial);
((CountAgg) agg).value += p;
}
}
@Override
public Object terminate(AggregationBuffer agg) throws HiveException {
result.set(((CountAgg) agg).value);
return result;
}
@Override
public Object terminatePartial(AggregationBuffer agg) throws HiveException {
return terminate(agg);
}
}
How does hive achieve count(distinct ...)? When task runs, it really cost much time.
Where is it in the source code?
As you can just run SELECT DISTINCT column1 FROM table1, DISTINCT expression isn’t a flag or option, it’s evaluated independently
This page says:
If you want drill down to source details, have a look into hive git repository