Here is the kernel that I am launching for calculating some array in parallel.

Question

0

Asked: June 9, 20262026-06-09T10:12:42+00:00 2026-06-09T10:12:42+00:00

Here is the kernel that I am launching for calculating some array in parallel.

0

Here is the kernel that I am launching for calculating some array in parallel.

 __device__ bool mult(int colsize,int rowsize,int *Aj,int *Bi,int *val)
    {       
        for(int j = 0; j < rowsize;j++)
        {           
           for(int k = 0;k < colsize;k++)
            {   
              if(Aj[j] == Bi[k])
               {    
                return true;
                }                               
            }           
        }
            return false;       
    }


__global__ void kernel(int *Aptr,int *Aj,int *Bptr,int *Bi,int rows,int cols,int *Cjc)
    {
        int tid = threadIdx.x + blockIdx.x * blockDim.x;
        int i;
        if(tid < cols)
        {
            int beg = Bptr[tid];
            int end = Bptr[tid+1];
            for(i = 0;i < rows;i++)
            {
                int cbeg = Aptr[i];
                int cend = Aptr[i+1];
                if(mult(end - beg,cend - cbeg,Aj+cbeg,Bi+beg))
                {                                                
                     Cjc[tid+1] += 1;
                     //atomicAdd(Cjc+tid+1,1);           
                }
            }                
        }               
    }

And here is how I decide the configuration of grid and blocks

int numBlocks,numThreads;

        if(q % 32 == 0)
        {
            numBlocks = q/32;
            numThreads = 32;
        }
        else
        {
            numBlocks = (q+31)/32;
            numThreads = 32;
        }
findkernel<<<numBlocks,numThreads>>>(devAptr,devAcol,devBjc,devBir,m,q,d_Cjc);

I am using GTX 480 with CC 2.0.
Now the problem that I am facing is that whenever q increases beyond 4096 the values in Cjc array are all produced as 0.
I know maximum number of blocks that I can use in X direction is 65535 and each block can have at most (1024,1024,64) threads. Then why does this kernel calculate the wrong output for Cjc array?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-09T10:12:44+00:00

OK so I finally figured out using cudaError_t that when I tried to cudaMemcpy the d_Cjc array from device to host, it throws following error.

CUDA error: the launch timed out and was terminated

It turns out that some of the calculations in findkernel are taking reasonably large amount of time which causes the display driver to terminate the program because of OS ‘watchdog’ time limit.

I believe I will have to shut down X server or ssh my gpu machine (from another machine) by removing its display.This will buy me some time to do the calculations that will not exceed the ‘watchdog’ limit of OS.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Here is the kernel that I am launching for calculating some array in parallel.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply