I am implementing a complicated algorithm on CUDA. But there is a really odd

Question

0

Asked: June 17, 20262026-06-17T20:02:39+00:00 2026-06-17T20:02:39+00:00

I am implementing a complicated algorithm on CUDA. But there is a really odd

0

I am implementing a complicated algorithm on CUDA. But there is a really odd problem. The problem can be summarised as following: the kernel will repeat a series of calculation many times. The calculation of the present iteration is upon the result of the previous one. I am using an array on the global memory for passing information between blocks in each iteration. For example there are 2 blocks, for each iteration block 0 saves the result to the global memory, then block 1 read it from the global memory. However the problem is that the block 1 can’t read the array from the global memory. it sometimes returns the result of the 1st iteration, not the previous one.

a_e and e_a are two arrays on the global mem, the size is [2*8].
d_a_e and d_e_a are on the shared mem, the size is [blockDim.x+1][8].

if(threadIdx.x<8)
{
     //block 0 writes, block 1 reads, this can't work properly
     a_e[blockIdx.x*8+threadIdx.x]=d_a_e[blockDim.x][threadIdx.x];
     if(blockIdx.x>0)
          d_a_e[0][threadIdx.x]=a_e[(blockIdx.x-1)*8+threadIdx.x];

     //block 1 writes, block 0 reads, this can work properly
     e_a[blockIdx.x*8+threadIdx.x]=d_e_a[0][threadIdx.x];
     if(blockIdx.x < gridDim.x-1)
          d_e_a[blockDim.x][threadIdx.x]=e_a[(blockIdx.x+1)*8+threadIdx.x];
}

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T20:02:40+00:00

This setup won’t work; you’re effectively trying to serialize your blocks, which as talonmies alluded to in his comment, doesn’t work. From the CUDA programming guide:

“Thread blocks are required to execute independently: It must be possible to execute them in any order, in parallel or in series. This independence requirement allows thread blocks to be scheduled in any order across any number of cores…”
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy

Your best recourse if probably to launch seperate kernels (such that you perform the block 0 computation in the 1st kernel, block 1 in the 2nd kernel, etc) to try to enforce that the results from the 1st kernel are done before reading them in the next kernel. There has been some work done on have inter-block synchronization, but you wouldn’t derive much benefit from them, as you need to serialize your blocks.

EDIT: I should also point out that the block scheduling isn’t documented, and is liable to change at any point, so any inter-block synchronization will be non-portable and liable to break on a driver or CUDA toolkit update.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am implementing a complicated algorithm on CUDA. But there is a really odd

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply