I’m using PTX from matlab to call CUDA kernels, when testing the code on VS 2010 like this:
int TPB = 256;
int BPG = (Nx + TPB -1 ) / TPB;
dim3 blk(TPB,TPB,1);
dim3 grid(BPG ,BPG,1);
grad<<< grid,blk>>>(dev_y,dev_x,dev_b,dev_t,Nx,Ny);
trying to use the same configuration in matlab
TPB = 16;
BPG = floor((Nx + TPB -1 ) / TPB);
grad = parallel.gpu.CUDAKernel('reg.ptx','reg.cu','grad');
grad.ThreadBlockSize=[TPB TPB 1];
grad.GridSize = [BPG BPG];
knowning it’s more than 512 thread per block which is the allowed number for my TESLA C1060, and I was right
Invalid size input to kernel ThreadBlockSize. You must provide a vector of up to 3 positive integers whose product is <= 512. The maximum value in each dimension is: [512,512,64].
any explanation why it’s run correctly on VS 2010 without error like what happened in MATLAB?
The C++ code segment is not checking for errors after grad<<<>>>. The MATLAB wrapper has additional error checking. The launch configuration is out of bounds. Calling cudaGetLastError after the <<<>>> will report the launch configuration error.