What is the relationship between a CUDA core, a streaming multiprocessor and the CUDA model of blocks and threads?
What gets mapped to what and what is parallelized and how? and what is more efficient, maximize the number of blocks or the number of threads?
My current understanding is that there are 8 cuda cores per multiprocessor. and that every cuda core will be able to execute one cuda block at a time. and all the threads in that block are executed serially in that particular core.
Is this correct?
The thread / block layout is described in detail in the CUDA programming guide. In particular, chapter 4 states:
Each SM contains 8 CUDA cores, and at any one time they’re executing a single warp of 32 threads – so it takes 4 clock cycles to issue a single instruction for the whole warp. You can assume that threads in any given warp execute in lock-step, but to synchronise across warps, you need to use
__syncthreads().