I want to port my c code to CUDA. The main computational part contains 3 for nested loops:
for (int i=0; i< Nx;i++){
for (int j=0;j<Ncontains[i];j++){
for (int k=0;k< totalVoxels;k++){
.......
}
}
}
How can I translate that to my CUDA kernel? With two for loops I could do something like:
int n= blockIdy.y * blockDim.y + threadIdx.y;
int i= blockIdx.x * blockDim.x + threadIdx.x;
But how can I initially this get running?
Many ways you can do it, One of them is:
The above you would call: