Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8531651
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 11, 20262026-06-11T09:32:10+00:00 2026-06-11T09:32:10+00:00

The kernel uses: ( –ptxas-options=-v ) 0 bytes stack frame, 0 bytes spill sotes,

  • 0

The kernel uses: (--ptxas-options=-v)
0 bytes stack frame, 0 bytes spill sotes, 0 bytes spill loads
ptxas info: Used 45 registers, 49152+0 bytes smem, 64 bytes cmem[0], 12 bytes cmem[16]

Launch with: kernelA<<<20,512>>>(float parmA, int paramB); and it will run fine.
Launch with: kernelA<<<20,513>>>(float parmA, int paramB); and it get the out of resources error. (too many resources requested for launch).

The Fermi device properties: 48KB of shared mem per SM, constant mem 64KB, 32K registers per SM, 1024 maximum threads per block, comp capable 2.1 (sm_21)

I’m using all my shared mem space.
I’ll run out of block register space around 700 threads/block. The kernel will not launch if I ask for more than half the number of MAX_threads/block. It may just be a coincidence, but I doubt it.

  1. Why can’t I use a full block of threads (1024)?
  2. Any guess as to which resource I’m running out of?
  3. I have often wondered where the stalled thread data/state goes between warps. What resource holds these?
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-11T09:32:11+00:00Added an answer on June 11, 2026 at 9:32 am

    When I did the reg count, I commented out the printf’s. Reg count= 45
    When it was running, it had the printf’s coded. Reg count= 63 w/plenty of spill “reg’s”.
    I suspect each thread really has 64 reg’s, with only 63 available to the program.
    64 reg’s * 512 threads = 32K – The maximum available to a single block.

    So I suggest the # of available “code” reg’s to a block = cudaDeviceProp::regsPerBlock – blockDim i.e. The kernel doesn’t have access to all 32K registers.
    The compiler currently limits the # of reg’s per thread to 63, (or they spill over to lmem). I suspect this 63, is a HW addressing limitation.

    So it looks like I’m running out of register space.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am developing a kernel module that uses unlocked_ioctl. I tested it with kernel
The Linux kernel uses struct pid to represent PID in kernel space.The C code
I'm messing around with Linux kernel 2.4 and function schedule() in sched.c uses the
I'm reading Understanding Linux Kernel. This is the snippet that explains how Linux uses
Matlab's fspecial unsharp mask uses a Laplacian kernel to achieve sharpening of the image.
How is WinRT implemented beneath the hood? Uses NT directly kernel directly or is
I've noticed that the Linux kernel code uses bool, but I thought that bool
I'm writing code that uses sched_setaffinity, which requires kernel 2.5.8 or later. I've been
The Linux kernel uses lock; addl $0,0(%%esp) as write barrier, while the RE2 library
I ran some tests on my kernel which uses constant cache. If I use

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.