Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6375873
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 25, 20262026-05-25T01:40:18+00:00 2026-05-25T01:40:18+00:00

I have a somewhat complex kernel with the following stats: ptxas info : Compiling

  • 0

I have a somewhat complex kernel with the following stats:

ptxas info    : Compiling entry function 'my_kernel' for 'sm_21'
ptxas info    : Function properties for my_kernel
    32 bytes stack frame, 64 bytes spill stores, 40 bytes spill loads
ptxas info    : Used 62 registers, 120 bytes cmem[0], 128 bytes cmem[2], 8 bytes cmem[14], 4 bytes cmem[16]

It’s not clear to me which part of the kernel is the “high water mark” in terms of register usage. The nature of the kernel is such that stubbing out various parts for constant values causes the optimizer to constant-fold later parts, etc. (at least that’s how it seems, since the numbers I get back when I do so don’t make much sense).

The CUDA profiler is similarly unhelpful AFAICT, simply telling me that I have register pressure.

Is there a way to get more information about register usage? I’d prefer a tool of some kind, but I’d also be interested in hearing about digging into the compiled binary directly, if that’s what it takes.

Edit: It is certainly possible for me to approach this bottom-up (ie. making experimental code changes, checking the impact on register usage, etc.) but I’d rather start top-down, or at least get some guidance on where to begin bottom-up investigation.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-25T01:40:19+00:00Added an answer on May 25, 2026 at 1:40 am

    You can get a feel for the complexity of the compiler output by compiling to annotated PTX like this:

    nvcc -ptx -Xopencc="-LIST:source=on" branching.cu
    

    which will issue a PTX assembler file with the original C code inside it as comments:

            .entry _Z11branchTest0PfS_S_ (
                    .param .u64 __cudaparm__Z11branchTest0PfS_S__a,
                    .param .u64 __cudaparm__Z11branchTest0PfS_S__b,
                    .param .u64 __cudaparm__Z11branchTest0PfS_S__d)
            {
            .reg .u16 %rh<4>;
            .reg .u32 %r<5>;
            .reg .u64 %rd<10>;
            .reg .f32 %f<5>;
            .loc    16      1       0
     //   1  __global__ void branchTest0(float *a, float *b, float *d)
    $LDWbegin__Z11branchTest0PfS_S_:
            .loc    16      7       0
     //   3         unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
     //   4         float aval = a[tidx], bval = b[tidx];
     //   5         float z0 = (aval > bval) ? aval : bval;
     //   6  
     //   7         d[tidx] = z0;
            mov.u16         %rh1, %ctaid.x;
            mov.u16         %rh2, %ntid.x;
            mul.wide.u16    %r1, %rh1, %rh2;
            cvt.u32.u16     %r2, %tid.x;
            add.u32         %r3, %r2, %r1;
            cvt.u64.u32     %rd1, %r3;
            mul.wide.u32    %rd2, %r3, 4;
            ld.param.u64    %rd3, [__cudaparm__Z11branchTest0PfS_S__a];
            add.u64         %rd4, %rd3, %rd2;
            ld.global.f32   %f1, [%rd4+0];
            ld.param.u64    %rd5, [__cudaparm__Z11branchTest0PfS_S__b];
            add.u64         %rd6, %rd5, %rd2;
            ld.global.f32   %f2, [%rd6+0];
            max.f32         %f3, %f1, %f2;
            ld.param.u64    %rd7, [__cudaparm__Z11branchTest0PfS_S__d];
            add.u64         %rd8, %rd7, %rd2;
            st.global.f32   [%rd8+0], %f3;
            .loc    16      8       0
     //   8  }
            exit;
    $LDWend__Z11branchTest0PfS_S_:
            } // _Z11branchTest0PfS_S_
    

    Note that this doesn’t directly tell you anything about the register usage, because PTX uses static-single assignment, but it shows you what the assembler is given as an input and how it relates to your original code. With the CUDA 4.0 toolkit, you can then compile the C to a cubin file for the Fermi architecture:

    $ nvcc -cubin -arch=sm_20 -Xptxas="-v" branching.cu
    ptxas info    : Compiling entry function '_Z11branchTest1PfS_S_' for 'sm_20'
    ptxas info    : Function properties for _Z11branchTest1PfS_S_
        0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
    

    and use the cuobjdump utility to disassemble the machine code the assembler produces.

    $ cuobjdump -sass branching.cubin 
    
    code for sm_20
        Function : _Z11branchTest0PfS_S_
    /*0000*/     /*0x00005de428004404*/     MOV R1, c [0x1] [0x100];
    /*0008*/     /*0x94001c042c000000*/     S2R R0, SR_CTAid_X;
    /*0010*/     /*0x84009c042c000000*/     S2R R2, SR_Tid_X;
    /*0018*/     /*0x10015de218000000*/     MOV32I R5, 0x4;
    /*0020*/     /*0x2000dc0320044000*/     IMAD.U32.U32 R3, R0, c [0x0] [0x8], R2;
    /*0028*/     /*0x10311c435000c000*/     IMUL.U32.U32.HI R4, R3, 0x4;
    /*0030*/     /*0x80319c03200b8000*/     IMAD.U32.U32 R6.CC, R3, R5, c [0x0] [0x20];
    /*0038*/     /*0x9041dc4348004000*/     IADD.X R7, R4, c [0x0] [0x24];
    /*0040*/     /*0xa0321c03200b8000*/     IMAD.U32.U32 R8.CC, R3, R5, c [0x0] [0x28];
    /*0048*/     /*0x00609c8584000000*/     LD.E R2, [R6];
    /*0050*/     /*0xb0425c4348004000*/     IADD.X R9, R4, c [0x0] [0x2c];
    /*0058*/     /*0xc0329c03200b8000*/     IMAD.U32.U32 R10.CC, R3, R5, c [0x0] [0x30];
    /*0060*/     /*0x00801c8584000000*/     LD.E R0, [R8];
    /*0068*/     /*0xd042dc4348004000*/     IADD.X R11, R4, c [0x0] [0x34];
    /*0070*/     /*0x00201c00081e0000*/     FMNMX R0, R2, R0, !pt;
    /*0078*/     /*0x00a01c8594000000*/     ST.E [R10], R0;
    /*0080*/     /*0x00001de780000000*/     EXIT;
        ......................................
    

    It is usually possible to trace back from assembler to PTX and get at least a rough idea where the “greedy” code sections are. Having said all that, managing register pressure is one of the more difficult aspects of CUDA programming at the moment. If/when NVIDIA ever document their ELF format for device code, I reckon a proper code analyzing tool would be a great project for someone.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a somewhat complex plot task in matplotlib that requires--I think--an autoscale() function
I have a somewhat complex data model in my iPad application (an OpenGL drawing
I have a somewhat complex custom user control, created in C# .NET, which is
I have an ObservableCollection of ChildViewModels with somewhat complex behaviour. When I go to
I have a PHP web app that generates a somewhat complex report from our
I have a somewhat complex query with multiple (nested) sub-queries, which I want to
I have a somewhat complex pricing mechanism in my application- Here are some of
I'm creating a game in which I have a somewhat complex method for creating
Particularly does memoization decrease performance somewhat? Is performance increase linear? I have a function
I have a somewhat complex regular expression which I'm trying to match against a

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.