Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8620223
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 12, 20262026-06-12T06:31:50+00:00 2026-06-12T06:31:50+00:00

I am developing a CUDA application for GTX 580 with CUDA Toolkit 4.0 and

  • 0

I am developing a CUDA application for GTX 580 with CUDA Toolkit 4.0 and Visual Studio 2010 Professional on Windows 7 64bit SP1. My program is more memory-intensive than typical CUDA programs, and I am trying to allocate as much shared memory as possible to each CUDA block. However, the program crashes every time I try to use more than 32K of shared memory for each block.

From reading official CUDA documentations, I learned that there is 48KB of on-die memory for each SM on a CUDA device with Compute Capability of 2.0 or greater, and the on-die memory is split between L1 cache and shared memory:

The same on-chip memory is used for both L1 and shared memory, and how much of
it is dedicated to L1 versus shared memory is configurable for each kernel call
(Section F.4.1)
http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/Fermi_Tuning_Guide.pdf

This led me to suspect that only 32KB of one-die memory was allocated as shared memory when my program was running. Hence my question: Is it possible to use all of 48KB of on-die memory as shared memory?

I tried everything I could think of. I specified the option --ptxas-options="-v -dlcm=cg" for nvcc, and I called cudaDeviceSetCacheConfig() and cudaFuncSetCacheConfig() in my program, but none of them resolved the issue. I even made sure that there was no register spilling and that I did not accidentally use local memory:

1>      24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>  ptxas info    : Used 63 registers, 40000+0 bytes smem, 52 bytes cmem[0], 2540 bytes cmem[2], 8 bytes cmem[14], 72 bytes cmem[16]

Although I can live with 32KB of shared memory, which already gave me a huge performance boost, I would rather take full advantage of all of the fast on-die memory. Any help is much appreciated.

Update: I was launching 640 threads when the program crashed. 512 gave me a better performance than 256 did, so I tried to increase the number of threads further.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-12T06:31:51+00:00Added an answer on June 12, 2026 at 6:31 am

    Your problem is not related to the shared memory configuration but with the number of threads you are launching.

    Using 63 register per threads and launching 640 threads give you a total of 40320 registers. The total amount of register of your device is 32K, so you are running out of resources.

    Regarding to the on-chip memory is well explained in the Tom’s answer, and as he commented, check the API calls for errors will help you for future errors.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm developing simple CUDA app. I followed steps given on http://www.ademiller.com/blogs/tech/2010/10/visual-studio-2010-adding-intellisense-support-for-cuda-c/ but still there
Developing locally in Visual Studio 2010 my page looks great. When I deploy to
Developing an application using SWT to work in both Linux and Windows. I created
Im developing a windows forms application using C# 4.0 and that application is going
I am developing a CUDA 4.0 application running on a Fermi card. According to
I am developing a Windows 64-bit application that will manage concurrent execution of different
Developing a C# .NET 2.0 WinForm Application. Need the application to close and restart
Iam developing one application.In that iam placing the radio buttons(uiimageview) on table view and
While developing an application using gwt in ecliplse crashed. Now the server is running
When developing an RCP application against a target platform, I ( and others )

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.