I have a kernel that, for each thread in a given block, computes a

Question

0

Asked: May 28, 20262026-05-28T01:29:59+00:00 2026-05-28T01:29:59+00:00

I have a kernel that, for each thread in a given block, computes a

0

I have a kernel that, for each thread in a given block, computes a for loop with a different number of iterations. I use a buffer of size N_BLOCKS to store the number of iterations required for each block. Hence, each thread in a given block must know the number of iterations specific to its block.

However, I’m not sure which way is the best (performance speaking) to read the value and distribute it to all the other threads. I see only one good way (please tell me if there is something better): store the value in shared memory and have each thread read it. For example:

__global__ void foo( int* nIterBuf )
{
   __shared__ int nIter;

   if( threadIdx.x == 0 )
      nIter = nIterBuf[blockIdx.x];

   __syncthreads();

   for( int i=0; i < nIter; i++ )
      ...
}

Any other better solutions? My app will use a lot of data, so I want the best performance.

Thanks!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-28T01:30:00+00:00

Read-only values that are uniform across all threads in a block are probably best stored in __constant__ arrays. On some CUDA architectures such as Fermi (SM 2.x), if you declare the array or pointer argument using the C++ const keyword AND you access it uniformly within the block (i.e. the index only depends on blockIdx, not threadIdx), then the compiler may automatically promote the reference to constant memory.

The advantage of constant memory is that it goes through a dedicated cache, so it doesn’t pollute the L1, and if the amount of data you are accessing per block is relatively small, after the first access within each block, you should always hit in the cache after the initial compulsory miss in each thread block.

You also won’t need to use any shared memory or transfer from global to shared memory.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a kernel that, for each thread in a given block, computes a

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply