I am developing a CUDA application for GTX 580 with CUDA Toolkit 4.0 and

Question

0

Asked: June 12, 20262026-06-12T06:31:50+00:00 2026-06-12T06:31:50+00:00

I am developing a CUDA application for GTX 580 with CUDA Toolkit 4.0 and

0

I am developing a CUDA application for GTX 580 with CUDA Toolkit 4.0 and Visual Studio 2010 Professional on Windows 7 64bit SP1. My program is more memory-intensive than typical CUDA programs, and I am trying to allocate as much shared memory as possible to each CUDA block. However, the program crashes every time I try to use more than 32K of shared memory for each block.

From reading official CUDA documentations, I learned that there is 48KB of on-die memory for each SM on a CUDA device with Compute Capability of 2.0 or greater, and the on-die memory is split between L1 cache and shared memory:

The same on-chip memory is used for both L1 and shared memory, and how much of
it is dedicated to L1 versus shared memory is configurable for each kernel call
(Section F.4.1)
http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/Fermi_Tuning_Guide.pdf

This led me to suspect that only 32KB of one-die memory was allocated as shared memory when my program was running. Hence my question: Is it possible to use all of 48KB of on-die memory as shared memory?

I tried everything I could think of. I specified the option --ptxas-options="-v -dlcm=cg" for nvcc, and I called cudaDeviceSetCacheConfig() and cudaFuncSetCacheConfig() in my program, but none of them resolved the issue. I even made sure that there was no register spilling and that I did not accidentally use local memory:

1>      24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>  ptxas info    : Used 63 registers, 40000+0 bytes smem, 52 bytes cmem[0], 2540 bytes cmem[2], 8 bytes cmem[14], 72 bytes cmem[16]

Although I can live with 32KB of shared memory, which already gave me a huge performance boost, I would rather take full advantage of all of the fast on-die memory. Any help is much appreciated.

Update: I was launching 640 threads when the program crashed. 512 gave me a better performance than 256 did, so I tried to increase the number of threads further.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-12T06:31:51+00:00

Your problem is not related to the shared memory configuration but with the number of threads you are launching.

Using 63 register per threads and launching 640 threads give you a total of 40320 registers. The total amount of register of your device is 32K, so you are running out of resources.

Regarding to the on-chip memory is well explained in the Tom’s answer, and as he commented, check the API calls for errors will help you for future errors.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am developing a CUDA application for GTX 580 with CUDA Toolkit 4.0 and

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply