This question is also started from following link: shared memory optimization confusion In above

Question

0

Asked: June 1, 20262026-06-01T16:28:27+00:00 2026-06-01T16:28:27+00:00

This question is also started from following link: shared memory optimization confusion In above

0

This question is also started from following link: shared memory optimization confusion

In above link, from talonmies’s answer, I found that the first condition of the number of blocks which will be scheduled to run is “8”. I have 3 questions as shown in below.

Does it mean that only 8 blocks can be scheduled at the same time when the number of blocks from condition 2 and 3 is over 8? Is it regardless of any condition such as cuda environment, gpu device, or algorithm?
If so, it really means that it is better not to use shared memory in some cases, it depends. Then we have to think how can we judge which one is better, using or not using shared memory. I think one approach is checking whether there is global memory access limitation (memory bandwidth bottleneck) or not. It means we can select “not using shared memory” if there is no global memory access limitation. Is it good approach?
Plus above question 2, I think if the data that my CUDA program should handle is huge, then we can think “not using shared memory” is better because it is hard to handle within the shared memory. Is it also good approach?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-01T16:28:29+00:00

The number of concurrently scheduled blocks are always going to be limited by something.

Playing with the CUDA Occupancy Calculator should make it clear how it works. The usage of three types of resources affect the number of concurrently scheduled blocks. They are, Threads Per Block, Registers Per Thread and Shared Memory Per Block.

If you set up a kernel that uses 1 Threads Per Block, 1 Registers Per Thread and 1 Shared Memory Per Block on Compute Capability 2.0, you are limited by Max Blocks per Multiprocessor, which is 8. If you start increasing Shared Memory Per Block, the Max Blocks per Multiprocessor will continue to be your limiting factor until you reach a threshold at which Shared Memory Per Block becomes the limiting factor. Since there are 49152 bytes of shared memory per SM, that happens at around 8 / 49152 = 6144 bytes (It’s a bit less because some shared memory is used by the system and it’s allocated in chunks of 128 bytes).

In other words, given the limit of 8 Max Blocks per Multiprocessor, using shared memory is completely free (as it relates to the number of concurrently running blocks), as long as you stay below the threshold at which Shared Memory Per Block becomes the limiting factor.

The same goes for register usage.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

This question is also started from following link: shared memory optimization confusion In above

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply