We get these ~50GB data files consisting of 16 byte codes, and I want

Question

0

Asked: May 10, 20262026-05-10T22:16:21+00:00 2026-05-10T22:16:21+00:00

We get these ~50GB data files consisting of 16 byte codes, and I want

0

We get these ~50GB data files consisting of 16 byte codes, and I want to find any code that occurs 1/2% of the time or more. Is there any way I can do that in a single pass over the data?

Edit: There are tons of codes – it’s possible that every code is different.

EPILOGUE: I’ve selected Darius Bacon as best answer, because I think the best algorithm is a modification of the majority element he linked to. The majority algorithm should be modifiable to only use a tiny amount of memory – like 201 codes to get 1/2% I think. Basically you just walk the stream counting up to 201 distinct codes. As soon as you find 201 distinct codes, you drop one of each code (deduct 1 from the counters, forgetting anything that becomes 0). At the end, you have dropped at most N/201 times, so any code occurring more times than that must still be around.

But it’s a two pass algorithm, not one. You need a second pass to tally the counts of the candidates. It’s actually easy to see that any solution to this problem must use at least 2 passes (the first batch of elements you load could all be different and one of those codes could end up being exactly 1/2%)

Thanks for the help!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

score 0 · Answer 1 · 2026-05-10T22:16:22+00:00

Metwally et al., Efficient Computation of Frequent and Top-k Elements in Data Streams (2005). There were some other relevant papers I read for my work at Yahoo that I can’t find now; but this looks like a good start.

Edit: Ah, see this Brian Hayes article. It sketches an exact algorithm due to Demaine et al., with references. It does it in one pass with very little memory, yielding a set of items including the frequent ones you’re looking for, if they exist. Getting the exact counts takes a (now-tractable) second pass.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

We get these ~50GB data files consisting of 16 byte codes, and I want

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply