I have a large array (say 512K elements), GPU resident, where only a small fraction of elements (say 5K randomly distributed elements – set S) needs to be processed. The algorithm to find out which elements belong to S is very efficient, so I can easily create an array A of pointers or indexes to elements from set S.
What is the most efficient way to run a CUDA or OpenCL kernel only over elements from S? Can I run a kernel over array A? All examples I’ve seen so far deal with contiguous 1D, 2D, or 3D arrays. Is there any problem with introducing one layer of indirection?
In CUDA contiguous (not random) memory access is preferred due to possible use of memory coalescing. It’s not a big deal to create array of randomly distributed indexes and proceed one index from A per thread, something like this:
But memory access to S[random access] will be very slow (here will be a most possible bottleneck).
If you decide to use CUDA, then you must experimenting a lot with blocks/grid sizes, minimize register consumption per thread (to maximize number of blocks per multiprocessor) and maybe sort A to use nearest S_ind from nearest threads…