We get these ~50GB data files consisting of 16 byte codes, and I want to find any code that occurs 1/2% of the time or more. Is there any way I can do that in a single pass over the data?
Edit: There are tons of codes – it’s possible that every code is different.
EPILOGUE: I’ve selected Darius Bacon as best answer, because I think the best algorithm is a modification of the majority element he linked to. The majority algorithm should be modifiable to only use a tiny amount of memory – like 201 codes to get 1/2% I think. Basically you just walk the stream counting up to 201 distinct codes. As soon as you find 201 distinct codes, you drop one of each code (deduct 1 from the counters, forgetting anything that becomes 0). At the end, you have dropped at most N/201 times, so any code occurring more times than that must still be around.
But it’s a two pass algorithm, not one. You need a second pass to tally the counts of the candidates. It’s actually easy to see that any solution to this problem must use at least 2 passes (the first batch of elements you load could all be different and one of those codes could end up being exactly 1/2%)
Thanks for the help!
Metwally et al., Efficient Computation of Frequent and Top-k Elements in Data Streams (2005). There were some other relevant papers I read for my work at Yahoo that I can’t find now; but this looks like a good start.
Edit: Ah, see this Brian Hayes article. It sketches an exact algorithm due to Demaine et al., with references. It does it in one pass with very little memory, yielding a set of items including the frequent ones you’re looking for, if they exist. Getting the exact counts takes a (now-tractable) second pass.