I have a kernel does a linear least square fit. It turns out threads are using too many registers, therefore, the occupancy is low. Here is the kernel,
__global__
void strainAxialKernel(
float* d_dis,
float* d_str
){
int i = threadIdx.x;
float a = 0;
float c = 0;
float e = 0;
float f = 0;
int shift = (int)((float)(i*NEIGHBOURS)/(float)WINDOW_PER_LINE);
int j;
__shared__ float dis[WINDOW_PER_LINE];
__shared__ float str[WINDOW_PER_LINE];
// fetch data from global memory
dis[i] = d_dis[blockIdx.x*WINDOW_PER_LINE+i];
__syncthreads();
// least square fit
for (j=-shift; j<NEIGHBOURS-shift; j++)
{
a += j;
c += j*j;
e += dis[i+j];
f += (float(j))*dis[i+j];
}
str[i] = AMP*(a*e-NEIGHBOURS*f)/(a*a-NEIGHBOURS*c)/(float)BLOCK_SPACING;
// compensate attenuation
if (COMPEN_EXP>0 && COMPEN_BASE>0)
{
str[i]
= (float)(str[i]*pow((float)i/(float)COMPEN_BASE+1.0f,COMPEN_EXP));
}
// write back to global memory
if (!SIGN_PRESERVE && str[i]<0)
{
d_str[blockIdx.x*WINDOW_PER_LINE+i] = -str[i];
}
else
{
d_str[blockIdx.x*WINDOW_PER_LINE+i] = str[i];
}
}
I have 32×404 blocks with 96 threads in each block. On GTS 250, the SM shall be able to handle 8 blocks. Yet, visual profiler shows I have 11 registers per thread, as a result, occupancy is 0.625 (5 blocks per SM). BTW, the shared memory used by each block is 792 B, so the register is the problem.
The performance is not end of the world. I am just curious if there is anyway I can get around this. Thanks.
There is always a trade-off between the fast but limited registers/shared memory and the slow but large global memory. There’s no way to “get around” that trade-off. If you use reduce register usage by using global memory, you should get higher occupancy but slower memory access.
That said, here are some ideas to use fewer registers:
a is computed as a simple arithmetic sequence, so reduce it… (something like this)
or
so instead, do something like the following (you can probably reduce these expressions further):