Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8059329
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 5, 20262026-06-05T09:38:29+00:00 2026-06-05T09:38:29+00:00

There has been much discussion about how to choose the #blocks & blockSize, but

  • 0

There has been much discussion about how to choose the #blocks & blockSize, but I still missing something. Many of my concerns address this question: How CUDA Blocks/Warps/Threads map onto CUDA Cores? (To simplify the discussion, there is enough perThread & perBlock memory. Memory limits are not an issue here.)

kernelA<<<nBlocks, nThreads>>>(varA,constB, nThreadsTotal);

1) To keep the SM as busy as possible, I should set nThreads to a multiple of warpSize. True?

2) An SM can only execute one kernel at a time. That is all HWcores of that SM are executing only kernelA. (Not some HWcores running kernelA, while others run kernelB.) So if I have only one thread to run, I’m “wasting” the other HWcores. True?

3)If the warp-scheduler issues work in units of warpSize (32 threads), and each SM has 32 HWcores, then the SM would be full utilized. What happens when the SM has 48 HWcores? How can I keep all 48 cores full utilized when the scheduler is issuing work in chunks of 32? (If the previous paragraph is true, wouldn’t it be better if the scheduler issued work in units of HWcore size?)

4) It looks like the warp-scheduler queues up 2 tasks at a time. So that when the currently-executing kernel stalls or blocks, the 2nd kernel is swapped in. (It is not clear, but I’ll guess the queue here is more than 2 kernels deep.) Is this correct?

5) If my HW has an upper limit of 512 threads-per-block (nThreadsMax), that doesn’t mean the kernel with 512 threads will run fastest on one block. (Again, mem not an issue.) There is a good chance I’ll get better performance if I spread the 512-thread kernel across many blocks, not just one. The block is executed on one or many SM’s. True?

5a) I’m thinking the smaller the better, but does it matter how small I make nBlocks? The question is, how to choose the value of nBlocks that is decent? (Not necessarily optimal.) Is there a mathematical approach to choosing nBlocks, or is it simply trial-n-err.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-05T09:38:31+00:00Added an answer on June 5, 2026 at 9:38 am

    Let me try to answer your questions one by one.

    1. That is correct.
    2. What do you mean exactly by “HWcores”? The first part of your statement is correct.
    3. According to the NVIDIA Fermi Compute Architecture Whitepaper: “The SM schedules threads in groups of 32 parallel threads called warps. Each SM features two warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently. Fermi’s dual warp scheduler selects two warps, and issues one instruction from each warp to a group of sixteen cores, sixteen load/store units, or four SFUs. Because warps execute independently, Fermi’s scheduler does not need to check for dependencies from within the instruction stream”.

      Furthermore, the NVIDIA Keppler Architecture Whitepaper states: “Kepler’s quad warp scheduler selects four warps, and two independent instructions per warp can be dispatched each cycle.”

      The “excess” cores are therefore used by scheduling more than one warp at a time.

    4. The warp scheduler schedules warps of the same kernel, not different kernels.

    5. Not quite true: Each block is locked in to a single SM, since that’s where its shared memory resides.
    6. That’s a tricky issue and depends on how your kernel is implemented. You may want to have a look at the nVidia Webinar Better Performance at Lower Occupancy by Vasily Volkov which explains some of the more important issues. Primarily, though, I would suggest you choose your thread count to improve occupancy, using the CUDA Occupancy Calculator.
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

There has been a considerable amout of discussion about code metrics (e.g.: What is
I know there has been much shout about the goodness of MVC pattern. My
There has been some discussion on the SO community wiki about whether database objects
There has been similar questions asked (and answered), but never really together, and I
I know there has been similar questions before, but I can't find a solution
I'am aware there has been a generic question about a best IDE in C++
Has there been discussion around how to resolve equivalent openids? Meaning, I personally have
There has been a lot of talk about contra-revolutionary NoSQL databases like Cassandra ,
I've recently come on board with a PHP application. There has not been much
There has been some discussion in abandoning our CI system (Hudson FWIW) due to

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.