Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8951615
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 15, 20262026-06-15T13:42:16+00:00 2026-06-15T13:42:16+00:00

The HyperLogLog algorithm by Flajolet et al describes a clever way to estimate the

  • 0

The HyperLogLog
algorithm by Flajolet et al describes a clever way to estimate the cardinality
of a set using only a tiny amount of memory. However, it does take into
account all N elements of the original set in the calculation. What if
we had access to only a small random sample (say, 10%) of the original N?
Has there been any research on how HyperLogLog or similar algorithms can be
adapted to this situation?

I am aware that this is essentially the problem described as distinct
value estimation
, for which abundant research exists (see for example this
paper
for an overview). However,
the research on the distinct value estimation that I’m aware of uses a number
of ad-hoc estimators very different from the approach used by HyperLogLog.
Therefore, I am wondering if someone has already thought of adapting
HyperLogLog to the distinct value estimation problem.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-15T13:42:18+00:00Added an answer on June 15, 2026 at 1:42 pm

    However, the research on the distinct value estimation that I’m aware
    of uses a number of ad-hoc estimators very different from the approach
    used by HyperLogLog.

    Yes, because they are solving a very different problem.

    Suppose you just confiscated a stash of 1.000.000 counterfeit dollar bills, and you want to know the number of distinct serial numbers.

    Sampling 100.000 of them (using HyperLogLog, as your antique steam-driven counting machine has only 1k memory) you count 5000 different serial numbers, each of which occurs somewhere around 20 times. Then you can be pretty sure that the whole stash will contain only a little over 5000 distinct serial numbers.

    Now suppose that 1 serial number occurs 95.001 times, and 4999 serial numbers occur only once. Apparently some bona fide bank notes found their way into your stash. Now you can be pretty confident that the stash contains around 5% honest banknotes, so that the entire stash contains around 50.000 distinct serial numbers

    Note that the distribution of the frequencies in your sample is used to infer something about the distribution in the entire stash. This is actually mentioned as one of the “ad hoc” (your words) methods in the second paper you cite (“Sampling-based estimation of the number of distinct values(..)”):

    The idea behind a parametric estimator is to fit a probability distribution to the
    observed relative frequencies of the different attribute values.

    Also note that the results of HyperLogLog and similar methods are completely insensitive to the distribution of the samples over their values. But your final estimate evidently depends very much on it!

    My advice: use method of your choice (like HyperLogLog) to count the number of distinct values in your sample, and then use one of the methods in “Sampling-based estimation” to estimate the number of values in your entire multiset , or use your prior knowledge abut the distribution of the multiset to calculate an estimate (maybe you saw the counterfeiters’ printing press, and you know it could only ever print one serial number)

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

No related questions found

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.