In the GenericUDAFCount.java: @Description(name = count, value = _FUNC_(*) – Returns the total number

Question

0

Asked: June 10, 20262026-06-10T17:28:44+00:00 2026-06-10T17:28:44+00:00

In the GenericUDAFCount.java: @Description(name = count, value = _FUNC_(*) – Returns the total number

0

In the GenericUDAFCount.java:

@Description(name = "count",
value = "_FUNC_(*) - Returns the total number of retrieved rows, including "
      +        "rows containing NULL values.\n"

      + "_FUNC_(expr) - Returns the number of rows for which the supplied "
      +        "expression is non-NULL.\n"

      + "_FUNC_(DISTINCT expr[, expr...]) - Returns the number of rows for "
      +        "which the supplied expression(s) are unique and non-NULL.")

but I don`t see any code to deal with the ‘distinct’ expression.

public static class GenericUDAFCountEvaluator extends GenericUDAFEvaluator {
private boolean countAllColumns = false;
private LongObjectInspector partialCountAggOI;
private LongWritable result;

@Override
public ObjectInspector init(Mode m, ObjectInspector[] parameters)
throws HiveException {
  super.init(m, parameters);
  partialCountAggOI =
    PrimitiveObjectInspectorFactory.writableLongObjectInspector;
  result = new LongWritable(0);
  return PrimitiveObjectInspectorFactory.writableLongObjectInspector;
}

private GenericUDAFCountEvaluator setCountAllColumns(boolean countAllCols) {
  countAllColumns = countAllCols;
  return this;
}

/** class for storing count value. */
static class CountAgg implements AggregationBuffer {
  long value;
}

@Override
public AggregationBuffer getNewAggregationBuffer() throws HiveException {
  CountAgg buffer = new CountAgg();
  reset(buffer);
  return buffer;
}

@Override
public void reset(AggregationBuffer agg) throws HiveException {
  ((CountAgg) agg).value = 0;
}

@Override
public void iterate(AggregationBuffer agg, Object[] parameters)
  throws HiveException {
  // parameters == null means the input table/split is empty
  if (parameters == null) {
    return;
  }
  if (countAllColumns) {
    assert parameters.length == 0;
    ((CountAgg) agg).value++;
  } else {
    assert parameters.length > 0;
    boolean countThisRow = true;
    for (Object nextParam : parameters) {
      if (nextParam == null) {
        countThisRow = false;
        break;
      }
    }
    if (countThisRow) {
      ((CountAgg) agg).value++;
    }
  }
}

@Override
public void merge(AggregationBuffer agg, Object partial)
  throws HiveException {
  if (partial != null) {
    long p = partialCountAggOI.get(partial);
    ((CountAgg) agg).value += p;
  }
}

@Override
public Object terminate(AggregationBuffer agg) throws HiveException {
  result.set(((CountAgg) agg).value);
  return result;
}

@Override
public Object terminatePartial(AggregationBuffer agg) throws HiveException {
  return terminate(agg);
}

}

How does hive achieve count(distinct ...)? When task runs, it really cost much time.
Where is it in the source code?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-10T17:28:45+00:00

Editorial Team

2026-06-10T17:28:45+00:00Added an answer on June 10, 2026 at 5:28 pm

As you can just run SELECT DISTINCT column1 FROM table1, DISTINCT expression isn’t a flag or option, it’s evaluated independently

This page says:

The actual filtering of data bound to parameter types for DISTINCT
implementation is handled by the framework and not the COUNT UDAF
implementation.

If you want drill down to source details, have a look into hive git repository

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

In the GenericUDAFCount.java: @Description(name = count, value = _FUNC_(*) – Returns the total number

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply