I have created a simple CUDA application to add two matrices. It is compiling fine. I want to know how the kernel will be launched by all the threads and what will the flow be inside CUDA? I mean, in what fashion every thread will execute each element of the matrices.
I know this is a very basic concept, but I don’t know this. I am confused regarding the flow.
You launch a grid of blocks.
Blocks are indivisibly assigned to multiprocessors (where the number of blocks on the multiprocessor determine the amount of available shared memory).
Blocks are further split into warps. For a Fermi GPU that is 32 threads that either execute the same instruction or are inactive (because they branched away, e.g. by exiting from a loop earlier than neighbors within the same warp or not taking the
ifthey did). On a Fermi GPU at most two warps run on one multiprocessor at a time.Whenever there is latency (that is execution stalls for memory access or data dependencies to complete) another warp is run (the number of warps that fit onto one multiprocessor – of the same or different blocks – is determined by the number of registers used by each thread and the amount of shared memory used by a/the block(s)).
This scheduling happens transparently. That is, you do not have to think about it too much.
However, you might want to use the predefined integer vectors
threadIdx(where is my thread within the block?),blockDim(how large is one block?),blockIdx(where is my block in the grid?) andgridDim(how large is the grid?) to split up work (read: input and output) among the threads. You might also want to read up how to effectively access the different types of memory (so multiple threads can be serviced within a single transaction) – but that’s leading off topic.NSight provides a graphical debugger that gives you a good idea of what’s happening on the device once you got through the jargon jungle. Same goes for its profiler regarding those things you won’t see in the debugger (e.g. stall reasons or memory pressure).
You can synchronize all threads within the grid (all there are) by another kernel launch.
For non-overlapping, sequential kernel execution no further synchronization is needed.
The threads within one grid (or one kernel run – however you want to call it) can communicate via global memory using atomic operations (for arithmetic) or appropriate memory fences (for load or store access).
You can synchronize all threads within one block with the intrinsic instruction
__syncthreads()(all threads will be active afterwards – although, as always, at most two warps can run on a Fermi GPU). The threads within one block can communicate via shared or global memory using atomic operations (for arithmetic) or appropriate memory fences (for load or store access).As mentioned earlier, all threads within a warp are always “synchronized”, although some might be inactive. They can communicate through shared or global memory (or “lane swapping” on upcoming hardware with compute capability 3). You can use atomic operations (for arithmetic) and volatile-qualified shared or global variables (load or store access happening sequentially within the same warp). The volatile qualifier tells the compiler to always access memory and never registers whose state cannot be seen by other threads.
Further, there are warp-wide vote functions that can help you make branch decisions or compute integer (prefix) sums.
OK, that’s basically it. Hope that helps. Had a good flow writing :-).