I have a CUDA program that calls the kernel repeatedly within a for loop.

Question

0

Asked: May 27, 20262026-05-27T22:35:55+00:00 2026-05-27T22:35:55+00:00

I have a CUDA program that calls the kernel repeatedly within a for loop.

0

I have a CUDA program that calls the kernel repeatedly within a for loop. The
code computes all rows of a matrix by using the values computed in the previous one
until the entire matrix is done. This is basically a dynamic programming algorithm.
The code below fills the (i,j) entry of many separate matrices in parallel with
the kernel.

for(i = 1; i <=xdim; i++){

  for(j = 1; j <= ydim; j++){ 

    start3time = clock();
    assign5<<<BLOCKS, THREADS>>>(Z, i, j, x, y, z)
    end3time = clock(); 
    diff = static_cast<double>(end3time-start3time)/(CLOCKS_PER_SEC / 1000); 
    printf("Time for i=%d j=%d is %f\n", i, j, diff); 
  }

}

The kernel assign5 is straightforward

__global__ void assign5(float* Z, int i, int j, int x, int y, int z) {

  int id = threadIdx.x + blockIdx.x * blockDim.x;

  char ch = database[j + id];

  Z[i+id] = (Z[x+id] + Z[y+id] + Z[z+id])*dev_matrix[i][index[ch - 'A']];

  }

}

My problem is that when I run this program the time for each i and j is 0 most of the
time but sometimes it is 10 milliseconds. So the output looks like

Time for i=0 j=0 is 0
Time for i=0 j=1 is 0
.
.
Time for i=15 j=21 is 10
Time for i=15 j=22 is 0
.

I don’t understand why this is happening. I don’t see a thread race condition. If I add

if(i % 20 == 0) cudaThreadSynchronize();

right after the first loop then the Time for i and j is mostly 0. But then the time
for sync is sometimes 10 or even 20. It seems like CUDA is performing many operations
at low cost and then charges a lot for later ones. Any help would be appreciated.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T22:35:56+00:00

I think you have a misconception about what a kernel call in CUDA actually does on the host. A kernel call is non-blocking and is only added to the device’s queue. If you’re measuring time before and after your kernel call, then the difference has nothing to do with how long your kernel call takes (it would measure the time it takes to add the kernel call to the queue).

You should add a cudaThreadSynchronize() after every kernel call and before you measure end3time. cudaThreadSynchronize() blocks and returns if all kernels in the queue have finished their work.

This is why

if(i % 20 == 0) cudaThreadSynchronize();

made spikes in your measurments.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a CUDA program that calls the kernel repeatedly within a for loop.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply