Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9271153
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 18, 20262026-06-18T15:32:55+00:00 2026-06-18T15:32:55+00:00

I am CUDA beginner. So far I learned, that each SM have 8 blocks

  • 0

I am CUDA beginner.

So far I learned, that each SM have 8 blocks (of threads). Let’s say I have simple job of multiplying elements in array by 2. However, I have less data than threads.

Not a problem, because I could cut off the “tail” of threads to make them idle. But if I understand correctly this would mean some SMs would get 100% of work, and some part (or even none).

So I would like to calculate which SM is running given thread and make computation in such way, that each SM has equal amount of work.

I hope it makes sense in the first place 🙂 If so, how to compute which SM is running given thread? Or — index of current SM and total numbers of them? In other words, equivalent on threadDim/threadIdx in SM terms.

Update

For comment it was too long.

Robert, thank you for your answer. While I try to digest all, here is what I do — I have a “big” array and I simply have to multiply the values *2 and store it to output array (as a warmup; btw. all computations I do, mathematically are correct). So first I run this in 1 block, 1 thread. Fine. Next, I tried to split work in such way that each multiplication is done just once by one thread. As the result my program runs around 6 times slower. I even sense why — small penalty for fetching the info about GPU, then computing how many blocks and threads I should use, then within each thread instead of single multiplications now I have around 10 extra multiplications just to compute the offset in the array for a thread. On one hand I try to find out how to change that undesired behaviour, on the other I would like to spread the “tail” of threads among SMs evenly.

I rephrase — maybe I am mistaken, but I would like to solve this. I have 1G small jobs (*2 that’s all) — should I create 1K blocks with 1K threads, or 1M blocks with 1 thread, 1 block with 1M threads, and so on. So far, I read GPU properties, divide, divide, and use blindly maximum values for each dimension of grid/block (or required value, if there is no data to compute).

The code

size is the size of the input and output array. In general:

output_array[i] = input_array[i]*2;

Computing how many blocks/threads I need.

size_t total_threads = props.maxThreadsPerMultiProcessor
                       * props.multiProcessorCount;
if (size<total_threads)
    total_threads = size;

size_t total_blocks = 1+(total_threads-1)/props.maxThreadsPerBlock;

size_t threads_per_block = 1+(total_threads-1)/total_blocks;  

Having props.maxGridSize and props.maxThreadsDim I compute in similar manner the dimensions for blocks and threads — from total_blocks and threads_per_block.

And then the killer part, computing the offset for a thread (“inside” the thread):

size_t offset = threadIdx.z;
size_t dim = blockDim.x;
offset += threadIdx.y*dim;
dim *= blockDim.y;
offset += threadIdx.z*dim;
dim *= blockDim.z;
offset += blockIdx.x*dim;
dim *= gridDim.x;
offset += blockIdx.y*dim;
dim *= gridDim.y;

size_t chunk = 1+(size-1)/dim;

So now I have starting offset for current thread, and the amount of data in array (chunk) for multiplication. I didn’t use grimDim.z above, because AFAIK is alway 1, right?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-18T15:32:57+00:00Added an answer on June 18, 2026 at 3:32 pm

    It’s an unusual thing to try to do. Given that you are a CUDA beginner, such a question seems to me to be indicative of attempting to solve a problem improperly. What is the problem you are trying to solve? How does it help your problem if you are executing a particular thread on SM X vs. SM Y? If you want maximum performance out of the machine, structure your work in a way such that all thread processors and SMs can be active, and in fact that there is “more than enough work” for all. GPUs depend on oversubscribed resources to hide latency.

    As a CUDA beginner, your goals should be:

    • create enough work, both in blocks and threads
    • access memory efficiently (this mostly has to do with coalescing – you can read up on that)

    There is no benefit to making sure that “each SM has an equal amount of work”. If you create enough blocks in your grid, each SM will have an approximately equal amount of work. This is the scheduler’s job, you should let the scheduler do it. If you do not create enough blocks, your first objective should be to create or find more work to do, not to come up with a fancy work breakdown per block that will yield no benefit.

    Each SM in the Fermi GPU (for example) has 32 thread processors. In order to keep these processors busy even in the presence of inevitable machine stalls due to memory accesses and the like, the machine is designed to hide latency by swapping in another warp of threads (32) when a stall occurs, so that processing can continue. In order to facilitate this, you should try to have a large number of available warps per SM. This is facilitated by having:

    • many threadblocks in your grid (at least 6 times the number of SMs in the GPU)
    • multiple warps per threadblock (probably at least 4 to 8 warps, so 128 to 256 threads per block)

    Since a (Fermi) SM is always executing 32 threads at a time, if I have fewer threads than 32 times the number of SMs in my GPU at any instant, then my machine is under-utilized. If my entire problem is only composed of, say, 20 threads, then it’s simply not well designed to take advantage of any GPU, and breaking those 20 threads up into multiple SMs/threadblocks is not likely to have any appreciable benefit.

    EDIT: Since you don’t want to post your code, I’ll make a few more suggestions or comments.

    1. You tried to modify some code, found that it runs slower, then jumped to (I think) the wrong conclusion.
    2. You should probably familiarize yourself with a simple code example like vector add. It’s not multiplying each element, but the structure is close. There’s no way performing this vector add using a single thread would actually run faster. I think if you study this example, you’ll find a straightforward way to extend it to do array element multiply-by-2.
    3. Nobody computes threads per block the way you have outlined. First of all, threads per block should be a multiple of 32. Secondly, it’s customary to pick threads per block as a starting point, and build your other launch parameters from it, not the other way around. For a large problem, just start with 256 or 512 threads per block, and dispense with the calculations for that.
    4. Build your other launch parameters (grid size) based on your chosen threadblock size. Your problem is 1D in nature, so a 1D grid of 1D threadblocks is a good starting point. If this calculation exceeds the machine limit in terms of max blocks in x-dimension, then you can either have each thread loop to process multiple elements or else extend to a 2D grid (of 1D threadblocks).
    5. Your offset calculation is needlessly complex. Refer to the vector add example about how to create a grid of threads with relatively simple offset calculation to process an array.
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm a CUDA beginner and reading on some thrust tutorials.I write a simple but
I have a CUDA project. It consists of several .cpp files that contain my
I have a CUDA application that on one computer (with a GTX 275) works
For CUDA, I understand that A block is never divided across multiple MPs. (http://llpanorama.wordpress.com/2008/06/11/threads-and-blocks-and-grids-oh-my/).
I have installed CUDA 5 toolkit (32 and 64 bit as that seemed to
CUDA runtime has a convenience function cudaGetErrorString(cudaError_t error) that translates an error enum into
Kernel launches in CUDA are generally asynchronous, which (as I understand) means that once
I am teaching myself CUDA with the programming guide that CUDA offers. For practice,
CUDA experts, if I have defined in the host code a new type: struct
The cuda profiler manual states that due to the more relaxed coalescing policy, the

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.