Short version: how do I implement an efficient filter operation in CUDA?
Long version:
I have a CUDA code that follows a queue filtering semantic.
I have ~5 million initial elements in the queue and the code filters them using an “expensive” stage-wise computation.
The end result is expected to retain ~1000 elements, and a each stage the number of removed elements follows an exponential decay curve (i.e. the first stages remove a lot, the last stages remove little).
Since in the GPU each element is processed in parallel (by blocks of threads), simply running “all stages over all elements” is quite waste-full. At a given stage, one element may be retained, all others may be already removed, but computation continues over all remaining stages even for elements already “ready to be removed”.
A more efficient approach would be to run each stage separately, reading an input list and storing results an intermediary output list; and then keep things running in a ping-pong schema.
Doing this however generates significant global memory read-writes, and more importantly puts pressure on an atomicInc that synchronizes the concurrent write on the output list.
How would you suggest to do such stage-wise filtering ?
Thanks for your answers and suggestions.
I suggest you to use
compactorremove_if. You can useCUDPPlibrary orthrust. You cannot avoid writing to global memory after each stage unless you will compute all stages over all elements.This is simple pseudocode: