I declared shared memory and tried to trace it with Nsight 2.2 for Visual Studio 2010. I’m using CUDA 4.2 with a Quadro 5000.
in my kernel.cu:
extern __shared__ ushort2 sampleGatheringSM[];
The code launching the kernel:
sampleGathering_SM_size =dimBlock.x*dimBlock.y*4*sizeof(ushort2)*2; // = 10240
sampleGatheringKernel<<<dimGrid, dimBlock, sampleGathering_SM_size >>>(dev_image, dev_gradient, width, height);
When I look at the analysis activity on Nsight then "CUDA Launches", it tells me that:
- Allocated Registers per block: 10240
- Allocated Shared Memory per block: 0
- Block Limit Reason: Registers
Did I allocate shared memory correctly?
I don’t understand how I could allocate registers.
EDIT:
It also tells me:
- Register per threads: 32
- Dynamic Shared memory per block: 0
- Static shared memory per block: 0
The declaration of dynamic shared memory is correct. Nsight 2.2 Analysis Trace Report has a bug that only occurs for CUDA Trace Activities. Analysis Trace Activities run with the option Nsight | Options| Analysis | CUDA Kernel Trace Mode = Serialized and Analysis Profiler CUDA Activities display the correct value. This bug will be fixed in the next version of Nsight.