The HyperLogLog algorithm by Flajolet et al describes a clever way to estimate the

Question

0

Asked: June 15, 20262026-06-15T13:42:16+00:00 2026-06-15T13:42:16+00:00

The HyperLogLog algorithm by Flajolet et al describes a clever way to estimate the

0

The HyperLogLog
algorithm by Flajolet et al describes a clever way to estimate the cardinality
of a set using only a tiny amount of memory. However, it does take into
account all N elements of the original set in the calculation. What if
we had access to only a small random sample (say, 10%) of the original N?
Has there been any research on how HyperLogLog or similar algorithms can be
adapted to this situation?

I am aware that this is essentially the problem described as distinct
value estimation, for which abundant research exists (see for example this
paper for an overview). However,
the research on the distinct value estimation that I’m aware of uses a number
of ad-hoc estimators very different from the approach used by HyperLogLog.
Therefore, I am wondering if someone has already thought of adapting
HyperLogLog to the distinct value estimation problem.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-15T13:42:18+00:00

However, the research on the distinct value estimation that I’m aware
of uses a number of ad-hoc estimators very different from the approach
used by HyperLogLog.

Yes, because they are solving a very different problem.

Suppose you just confiscated a stash of 1.000.000 counterfeit dollar bills, and you want to know the number of distinct serial numbers.

Sampling 100.000 of them (using HyperLogLog, as your antique steam-driven counting machine has only 1k memory) you count 5000 different serial numbers, each of which occurs somewhere around 20 times. Then you can be pretty sure that the whole stash will contain only a little over 5000 distinct serial numbers.

Now suppose that 1 serial number occurs 95.001 times, and 4999 serial numbers occur only once. Apparently some bona fide bank notes found their way into your stash. Now you can be pretty confident that the stash contains around 5% honest banknotes, so that the entire stash contains around 50.000 distinct serial numbers

Note that the distribution of the frequencies in your sample is used to infer something about the distribution in the entire stash. This is actually mentioned as one of the “ad hoc” (your words) methods in the second paper you cite (“Sampling-based estimation of the number of distinct values(..)”):

The idea behind a parametric estimator is to fit a probability distribution to the
observed relative frequencies of the different attribute values.

Also note that the results of HyperLogLog and similar methods are completely insensitive to the distribution of the samples over their values. But your final estimate evidently depends very much on it!

My advice: use method of your choice (like HyperLogLog) to count the number of distinct values in your sample, and then use one of the methods in “Sampling-based estimation” to estimate the number of values in your entire multiset , or use your prior knowledge abut the distribution of the multiset to calculate an estimate (maybe you saw the counterfeiters’ printing press, and you know it could only ever print one serial number)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

The HyperLogLog algorithm by Flajolet et al describes a clever way to estimate the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply