Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6241831
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 24, 20262026-05-24T11:50:20+00:00 2026-05-24T11:50:20+00:00

I have a question about the throughput of a kernel running on a GPU.

  • 0

I have a question about the throughput of a kernel running on a GPU. Assuming its occupancy is 0.5, block size is 256: the programming guide states that it is better to have many blocks so they can hide the memory latency, etc. But I don’t understand why this is correct. Because as soon as the kernel has a number of warp per Streaming Multi-processor = 24, i.e., 3 blocks, it will reach the peak throughput. So having more than 24 warps (or 3 blocks) won’t change anything to the throughput.

Am I missing anything? Can anyone correct me?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-24T11:50:21+00:00Added an answer on May 24, 2026 at 11:50 am

    While it is true that low occupancy SMs cannot sufficiently hide latency, it is important to understand this:

    Higher Occupancy != Higher Throughput!

    Occupancy is simply a measure of how much work is available for the SM to choose from at any given instant. Having more resident warps gives the SM more ability to do useful work while other warps are waiting for results (results of memory accesses, or computations — both have non-zero latency).

    Throughput is a measure of how much work gets done per second, and while it can be limited by latency (and therefore occupancy), it also can be limited by memory bandwidth, instruction throughput (the number of execution units), and other factors.

    The reason the programming guide states that it is better to have multiple thread blocks than just one large thread block is because sometimes it is better to be able to issue work from not just other warps but also other blocks. Here’s an example:

    Imagine that your big thread block has to load data from global memory (high latency) and store it in to shared memory (low latency), and then must immediately do a __syncthreads(). In this case, when a warp is finished loading its data and writing it to shared memory, it must then wait until all other threads in the block finish doing the same. For a large block, that can be quite a while. But if there are multiple smaller thread blocks occupying the SM, then the SM could switch and do work from the other blocks while waiting for the __syncthreads to be satisfied in the first block. This can help reduce GPU idle time and improve efficiency.

    You don’t necessarily want to have really tiny blocks (since the SMs on Fermi support at most 8 resident blocks), but having blocks of 128-512 threads is often more efficient than using blocks with 1024 threads.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have question about NSView: Imagine a Custom View where the mouseDown, mouseDrag and
I have question about normalization. Suppose I have an applications dealing with songs. First
I have a question about using streams in .NET to load files from disk.
I have a question about best practices regarding how one should approach storing complex
I have a question about locking. This doesn't have to be only about record
I have a question about how to deploy WPF application into a PC without
I have a question about using os.execvp in Python. I have the following bit
I have a question about using new[] . Imagine this: Object.SomeProperty = new[] {string1,
I have a question about this question . I posted a reply there but
I have a question about tables in MySQL. I'm currently making a website where

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.