The kernel uses: (--ptxas-options=-v)
0 bytes stack frame, 0 bytes spill sotes, 0 bytes spill loads
ptxas info: Used 45 registers, 49152+0 bytes smem, 64 bytes cmem[0], 12 bytes cmem[16]
Launch with: kernelA<<<20,512>>>(float parmA, int paramB); and it will run fine.
Launch with: kernelA<<<20,513>>>(float parmA, int paramB); and it get the out of resources error. (too many resources requested for launch).
The Fermi device properties: 48KB of shared mem per SM, constant mem 64KB, 32K registers per SM, 1024 maximum threads per block, comp capable 2.1 (sm_21)
I’m using all my shared mem space.
I’ll run out of block register space around 700 threads/block. The kernel will not launch if I ask for more than half the number of MAX_threads/block. It may just be a coincidence, but I doubt it.
- Why can’t I use a full block of threads (1024)?
- Any guess as to which resource I’m running out of?
- I have often wondered where the stalled thread data/state goes between warps. What resource holds these?
When I did the reg count, I commented out the printf’s. Reg count= 45
When it was running, it had the printf’s coded. Reg count= 63 w/plenty of spill “reg’s”.
I suspect each thread really has 64 reg’s, with only 63 available to the program.
64 reg’s * 512 threads = 32K – The maximum available to a single block.
So I suggest the # of available “code” reg’s to a block = cudaDeviceProp::regsPerBlock – blockDim i.e. The kernel doesn’t have access to all 32K registers.
The compiler currently limits the # of reg’s per thread to 63, (or they spill over to lmem). I suspect this 63, is a HW addressing limitation.
So it looks like I’m running out of register space.