I tried a method so that the data transfer from Host to Device will not be used. Normally, we assign values to the elements in the Host array using a loop and transfer it to the Device. This works fine for me on 1D and 2D arrays. The new method i tried is, to give the values to the array elements in the kernel. I succeeded for 1D arrays. But, for 2D array, the result is 0. My device can support (512,512) threads per block. The output values are coming fine upto Length=22 but displays ‘0’ for Length=23 [22<sqrt(512)<23]. As per [22<sqrt(512)<23], i can see that only 22x22 threads are being used. Whats the problem?? Why is this happening?
The Code:
const int Length=23;
Main Function:
int A[Length],B[Length],C[Length],D[Length],*Ad,*Bd;
int size=Length*sizeof(int);
cudaMalloc((void**)&Ad,size);
cudaMalloc((void**)&Bd,size);
dim3 dimGrid(1,1);
dim3 dimBlock(Length,Length);
FuncG<<<dimGrid,dimBlock>>>(Ad,Bd);
cudaMemcpy(C,Ad,size,cudaMemcpyDeviceToHost);
cudaMemcpy(D,Bd,size,cudaMemcpyDeviceToHost);
for(int i=0;i<Length;i++){
printf("%d %d\n",C[i],D[i]);
}
return 0;
Kernel Function:
__global__ void FuncG(int *Ad,int *Bd){
int tx=threadIdx.x;
int ty=threadIdx.y;
Ad[tx]=tx;
Bd[ty]=ty;
}
Your device can only support 512 threads per block. The maximum dimensions of the first two thread block dimensions are 512. A 22×22 block (484 threads) is a legal block size, but a 23×23 block (529 threads) is not.
You are getting 0 output because the kernel is never running. If you check for it, you will find the kernel launch is failing with an invalid execution configuration error. The canonical way to check for a launch failure of this kind is something like: