How are threads organized to be executed by a GPU?
Share
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
Hardware
If a GPU device has, for example, 4 multiprocessing units, and they can run 768 threads each: then at a given moment no more than 4*768 threads will be really running in parallel (if you planned more threads, they will be waiting their turn).
Software
threads are organized in blocks. A block is executed by a multiprocessing unit.
The threads of a block can be indentified (indexed) using 1Dimension(x), 2Dimensions (x,y) or 3Dim indexes (x,y,z) but in any case xyz <= 768 for our example (other restrictions apply to x,y,z, see the guide and your device capability).
Obviously, if you need more than those 4*768 threads you need more than 4 blocks.
Blocks may be also indexed 1D, 2D or 3D. There is a queue of blocks waiting to enter
the GPU (because, in our example, the GPU has 4 multiprocessors and only 4 blocks are
being executed simultaneously).
Now a simple case: processing a 512×512 image
Suppose we want one thread to process one pixel (i,j).
We can use blocks of 64 threads each. Then we need 512*512/64 = 4096 blocks
(so to have 512×512 threads = 4096*64)
It’s common to organize (to make indexing the image easier) the threads in 2D blocks having blockDim = 8 x 8 (the 64 threads per block). I prefer to call it threadsPerBlock.
and 2D gridDim = 64 x 64 blocks (the 4096 blocks needed). I prefer to call it numBlocks.
The kernel is launched like this:
Finally: there will be something like “a queue of 4096 blocks”, where a block is waiting to be assigned one of the multiprocessors of the GPU to get its 64 threads executed.
In the kernel the pixel (i,j) to be processed by a thread is calculated this way: