I have a question about the throughput of a kernel running on a GPU.

Question

0

Asked: May 24, 20262026-05-24T11:50:20+00:00 2026-05-24T11:50:20+00:00

I have a question about the throughput of a kernel running on a GPU.

0

I have a question about the throughput of a kernel running on a GPU. Assuming its occupancy is 0.5, block size is 256: the programming guide states that it is better to have many blocks so they can hide the memory latency, etc. But I don’t understand why this is correct. Because as soon as the kernel has a number of warp per Streaming Multi-processor = 24, i.e., 3 blocks, it will reach the peak throughput. So having more than 24 warps (or 3 blocks) won’t change anything to the throughput.

Am I missing anything? Can anyone correct me?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-24T11:50:21+00:00

While it is true that low occupancy SMs cannot sufficiently hide latency, it is important to understand this:

Higher Occupancy != Higher Throughput!

Occupancy is simply a measure of how much work is available for the SM to choose from at any given instant. Having more resident warps gives the SM more ability to do useful work while other warps are waiting for results (results of memory accesses, or computations — both have non-zero latency).

Throughput is a measure of how much work gets done per second, and while it can be limited by latency (and therefore occupancy), it also can be limited by memory bandwidth, instruction throughput (the number of execution units), and other factors.

The reason the programming guide states that it is better to have multiple thread blocks than just one large thread block is because sometimes it is better to be able to issue work from not just other warps but also other blocks. Here’s an example:

Imagine that your big thread block has to load data from global memory (high latency) and store it in to shared memory (low latency), and then must immediately do a __syncthreads(). In this case, when a warp is finished loading its data and writing it to shared memory, it must then wait until all other threads in the block finish doing the same. For a large block, that can be quite a while. But if there are multiple smaller thread blocks occupying the SM, then the SM could switch and do work from the other blocks while waiting for the __syncthreads to be satisfied in the first block. This can help reduce GPU idle time and improve efficiency.

You don’t necessarily want to have really tiny blocks (since the SMs on Fermi support at most 8 resident blocks), but having blocks of 128-512 threads is often more efficient than using blocks with 1024 threads.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a question about the throughput of a kernel running on a GPU.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply