I have something I don’t understand about CUDA. I understand that there are ‘virtual’ variables called threads.
When programming the kernel the thread variables are running automatically , and the kernel is running over and over again for every thread. (Correct so far?)
Now if I program something like this:
for (int i = 0 ; i<100; i++){
....
}
Is it run over and over again for every thread? Or just once?
Every code you put in the kernel function (prefixed by
__global__) would be executed by all concurrent threads (which the mass of threads is specified at the kernel launch time). In the body of kernel, you can differentiate the computation of threads according to their global identifier or local identifier:for 1D (potentially your case):
local identifier:
int tid = threadIdx.xglobal identifier:
int tid = blockIdx.x*blockDim.x + threadIdx.xFurther clarifications:
If you have this kernel:
and you want to launch 4096 concurrent threads, you should organize the threads into thread-blocks (to exploit locality and tackle hardware limitations). If you break 4096 threads into thread-block of 256-thread, you can run this mass of threads to execute the
dummyfunction by:The
dummyfunction will be executed by 4096 threads, one for each, serially or in-parallel depending on the hardware (real GPU). You should suppose all threads are running in parallel. You can differentiate the computation of threads using thread identifier (globally) as described above.