Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 657687
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 13, 20262026-05-13T22:51:05+00:00 2026-05-13T22:51:05+00:00

How are threads organized to be executed by a GPU?

  • 0

How are threads organized to be executed by a GPU?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-13T22:51:06+00:00Added an answer on May 13, 2026 at 10:51 pm

    Hardware

    If a GPU device has, for example, 4 multiprocessing units, and they can run 768 threads each: then at a given moment no more than 4*768 threads will be really running in parallel (if you planned more threads, they will be waiting their turn).

    Software

    threads are organized in blocks. A block is executed by a multiprocessing unit.
    The threads of a block can be indentified (indexed) using 1Dimension(x), 2Dimensions (x,y) or 3Dim indexes (x,y,z) but in any case xyz <= 768 for our example (other restrictions apply to x,y,z, see the guide and your device capability).

    Obviously, if you need more than those 4*768 threads you need more than 4 blocks.
    Blocks may be also indexed 1D, 2D or 3D. There is a queue of blocks waiting to enter
    the GPU (because, in our example, the GPU has 4 multiprocessors and only 4 blocks are
    being executed simultaneously).

    Now a simple case: processing a 512×512 image

    Suppose we want one thread to process one pixel (i,j).

    We can use blocks of 64 threads each. Then we need 512*512/64 = 4096 blocks
    (so to have 512×512 threads = 4096*64)

    It’s common to organize (to make indexing the image easier) the threads in 2D blocks having blockDim = 8 x 8 (the 64 threads per block). I prefer to call it threadsPerBlock.

    dim3 threadsPerBlock(8, 8);  // 64 threads
    

    and 2D gridDim = 64 x 64 blocks (the 4096 blocks needed). I prefer to call it numBlocks.

    dim3 numBlocks(imageWidth/threadsPerBlock.x,  /* for instance 512/8 = 64*/
                  imageHeight/threadsPerBlock.y); 
    

    The kernel is launched like this:

    myKernel <<<numBlocks,threadsPerBlock>>>( /* params for the kernel function */ );       
    

    Finally: there will be something like “a queue of 4096 blocks”, where a block is waiting to be assigned one of the multiprocessors of the GPU to get its 64 threads executed.

    In the kernel the pixel (i,j) to be processed by a thread is calculated this way:

    uint i = (blockIdx.x * blockDim.x) + threadIdx.x;
    uint j = (blockIdx.y * blockDim.y) + threadIdx.y;
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Many threads have access to summary . Each thread will have an unique key
use threads; use threads::shared; sub test { my $s :shared = 22; my $thread
Kernel threads do context switch at kernel level instead of process level. I am
Often I create Child threads within the main() as Thread thread = new Thread(new
I know that many threads has been created here & on the internet about
I saw many threads with this tittle, but no one really speak about reuse
There are several threads on this here at SO but I didn't find one
Each of of these threads is searching through a different list of data objects
I am new to threads. I want to communicate with multiple sensors at one
I'm building a web app that will (ideally) allow users to follow discussion threads

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.