I’m looking for a single-pass algorithm for finding the topX percent of floats in a stream where I do not know the total number ahead of time … but its on the order of 5-30 million floats. It needs to be single-pass since the data is generated on the fly and recreate the exact stream a second time.
The algorithm I have so far is to keep a sorted list of the topX items that I’ve seen so far. As the stream continues I enlarge the list as needed. Then I use bisect_left to find the insertion point if needed.
Below is the algorithm I have so far:
from bisect import bisect_left
from random import uniform
from itertools import islice
def data_gen(num):
for _ in xrange(num):
yield uniform(0,1)
def get_top_X_percent(iterable, percent = 0.01, min_guess = 1000):
top_nums = sorted(list(islice(iterable, int(percent*min_guess)))) #get an initial guess
for ind, val in enumerate(iterable, len(top_nums)):
if int(percent*ind) > len(top_nums):
top_nums.insert(0,None)
newind = bisect_left(top_nums, val)
if newind > 0:
top_nums.insert(newind, val)
top_nums.pop(0)
return top_nums
if __name__ == '__main__':
num = 1000000
all_data = sorted(data_gen(num))
result = get_top_X_percent(all_data)
assert result[0] == all_data[-int(num*0.01)], 'Too far off, lowest num:%f' % result[0]
print result[0]
In the real case the data does not come from any standard distribution (otherwise I could use some statistics knowledge).
Any suggestions would be appreciated.
I’m not sure there’s any way to actually do that reliably, as the range denoted by the “top X percent” can grow unpredictably as you see more elements. Consider the following input:
If you wanted the top 25% of elements, you’d end up picking 101 and 102 out of the first ten elements, but after seeing enough zeroes after there you’d eventually have to end up selecting all of the first ten. This same pattern can be expanded to any sufficiently large stream — it’s always possible to end up getting misled by appearances and discarding elements that you actually should have kept. As such, unless you know the exact length of the stream ahead of time, I don’t think this is possible (short of keeping every element in memory until you hit the end of the stream).