I have a somewhat complex kernel with the following stats:
ptxas info : Compiling entry function 'my_kernel' for 'sm_21'
ptxas info : Function properties for my_kernel
32 bytes stack frame, 64 bytes spill stores, 40 bytes spill loads
ptxas info : Used 62 registers, 120 bytes cmem[0], 128 bytes cmem[2], 8 bytes cmem[14], 4 bytes cmem[16]
It’s not clear to me which part of the kernel is the “high water mark” in terms of register usage. The nature of the kernel is such that stubbing out various parts for constant values causes the optimizer to constant-fold later parts, etc. (at least that’s how it seems, since the numbers I get back when I do so don’t make much sense).
The CUDA profiler is similarly unhelpful AFAICT, simply telling me that I have register pressure.
Is there a way to get more information about register usage? I’d prefer a tool of some kind, but I’d also be interested in hearing about digging into the compiled binary directly, if that’s what it takes.
Edit: It is certainly possible for me to approach this bottom-up (ie. making experimental code changes, checking the impact on register usage, etc.) but I’d rather start top-down, or at least get some guidance on where to begin bottom-up investigation.
You can get a feel for the complexity of the compiler output by compiling to annotated PTX like this:
which will issue a PTX assembler file with the original C code inside it as comments:
Note that this doesn’t directly tell you anything about the register usage, because PTX uses static-single assignment, but it shows you what the assembler is given as an input and how it relates to your original code. With the CUDA 4.0 toolkit, you can then compile the C to a cubin file for the Fermi architecture:
and use the
cuobjdumputility to disassemble the machine code the assembler produces.It is usually possible to trace back from assembler to PTX and get at least a rough idea where the “greedy” code sections are. Having said all that, managing register pressure is one of the more difficult aspects of CUDA programming at the moment. If/when NVIDIA ever document their ELF format for device code, I reckon a proper code analyzing tool would be a great project for someone.