I have a large array (say 512K elements), GPU resident, where only a small

Question

0

Asked: May 16, 20262026-05-16T07:32:00+00:00 2026-05-16T07:32:00+00:00

I have a large array (say 512K elements), GPU resident, where only a small

0

I have a large array (say 512K elements), GPU resident, where only a small fraction of elements (say 5K randomly distributed elements – set S) needs to be processed. The algorithm to find out which elements belong to S is very efficient, so I can easily create an array A of pointers or indexes to elements from set S.

What is the most efficient way to run a CUDA or OpenCL kernel only over elements from S? Can I run a kernel over array A? All examples I’ve seen so far deal with contiguous 1D, 2D, or 3D arrays. Is there any problem with introducing one layer of indirection?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-16T07:32:00+00:00

In CUDA contiguous (not random) memory access is preferred due to possible use of memory coalescing. It’s not a big deal to create array of randomly distributed indexes and proceed one index from A per thread, something like this:

__global__ kernel_func(unsigned * A, float * S)
{
    const unsigned idx = threadIdx.x + blockIdx.x * blockDim.x;
    const unsigned S_idx = A[idx];

    S[S_idx] *= 5; // for example...
    ...
}

But memory access to S[random access] will be very slow (here will be a most possible bottleneck).

If you decide to use CUDA, then you must experimenting a lot with blocks/grid sizes, minimize register consumption per thread (to maximize number of blocks per multiprocessor) and maybe sort A to use nearest S_ind from nearest threads…

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a large array (say 512K elements), GPU resident, where only a small

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply