I am using CUDA to do calculations on a potentially large 3D data set. I think it is best to see a short code snippet first:
void launch_kernel(/*arguments . . . */){
int bx = xend-xstart, by = yend-ystart, bz = zend-zstart;
dim3 blocks(/*dimensions*/);
dim3 threads(/*dimensions*/);
kernel<<blocks, threads>>();
}
I have a 3D set of cells and I need to launch a kernel to compute each one. The problem is that the input size may exceed the capabilities of the GPU, specifically the threads. So code like this:
void launch_kernel(/*arguments . . . */){
int bx = xend-xstart, by = yend-ystart, bz = zend-zstart;
dim3 blocks(bx,by,1);
dim3 threads(bz);
kernel<<blocks, threads>>();
}
… doesn’t work well. Because what if the dimensions are 1000x1000x1000? – I can’t launch 1000 threads per block. Or even better, what if the dimensions are 5x5x1000? – Now I am barely launching any blocks, but the kernel would need to be launched 5x5x512 b/c of the hardware and each thread would do 2 calculations. I also can’t just mash up all my dimensions, putting some of the z’s in the blocks and some in the threads b/c I need to be able to identify the dimensions in the kernel. Currently:
__global__ void kernel(/*arguments*/){
int x = xstart + blockIdx.x;
int y = ystart + blockIdx.y;
int z = zstart + threadIdx.x;
if(x < xend && y < yend && z < zend){
//calculate
}
}
I need a solid, efficient way to figure out these variables:
the block x dimension, block y dimensions, thread x (and y? and z?), the x,y,z once I am in the kernel through the blockIdx and threadIdx, and, if the input exceeds hardware, the amount of a “step” I take for each dimension in a for loop inside my kernel calculation.
If you have a questions, please ask. This is a difficult question, and it has been troubling me (especially since the amount of blocks/threads I launch is a major component of performance). This code needs to be automated in its decisions for different data sets, and I am not sure how to do that efficiently. Thank you in advance.
I think you are vastly over complicating things here. The basic problem seems to be that you need to run a kernel on a 1000 x 1000 x 1000 computational domain. So you require 1000000000 threads, which is well within the capabilities of all CUDA compatible hardware. So just use a standard 2D CUDA execution grid with at least the number of threads needed to do the computation (if you don’t understand how to do that leave a comment and I will add it to the answer) and then inside your kernel call a little setup function something like this:
[disclaimer: written in browser, never compiled, never run, never tested. Use at own risk].
This function will return “logical” thread coordinates in the 3D domain (dimx,dimy,dimz) from a CUDA 2D execution grid. Call it at the beginning of the kernel something like this:
Note that there is a lot of integer computational overhead in getting that grid set up, so you might want to think about why you really need a 3D grid. You would be surprised at the number of times it isn’t actually necessary and much of that set up overhead can be avoided.