I have a lot of random floating point numbers residing in global GPU memory. I also have “buckets” that specify ranges of numbers they will accept and a capacity of numbers they will accept.
ie:
numbers: -2 0 2 4
buckets(size=1): [-2, 0], [1, 5]
I want to run a filtration process that yields me
filtered_nums: -2 2
(where filtered_nums can be a new block of memory)
But every approach I take runs into a huge overhead of trying to synchronize threads across bucket counters. If I try to use a single-thread, the algorithm completes successfully, but takes frighteningly long (over 100 times slower than generating the numbers in the first place).
What I am asking for is a general high-level, efficient, as-simple-as-possible approach algorithm that you would use to filter these numbers.
edit
I will be dealing with 10 buckets and half a million numbers. Where all the numbers fall into exactly 1 of the 10 bucket ranges. Each bucket will hold 43000 elements. (There are excess elements, since the objective is to fill every bucket, and many numbers will be discarded).
2nd edit
It’s important to point out that the buckets do not have to be stored individually. The objective is just to discard elements that would not fit into a bucket.
You can use thrust::remove_copy_if
You will have to replace lo and hi with constants for each bin..
I think you can templatize the kernel, but then again you will have to instantiate the template with actual constants. I can’t see an easy way at it, but I may be missing something.
If you are willing to look at third party libraries, arrayfire may offer an easier solution.