I have the following representative code:
__global__ void func()
{
register ushort4 result = make_ushort4(__float2half_rn(0.5), __float2half_rn(0.5), __float2half_rn(0.5), __float2half_rn(1.0));
}
When compiling, result is stored in local memory. Is it possible to force this to registers? Local memory is too slow for the intended application.
Furthermore, this result must be stored to an array of var4 elements. I would like to store these results coalesced, like ((ushort4*)(output))[x + y * width] = result;. Another solution without var4 is also an option.
A vector type should be compiled into registers if there is available registers to do so. Turning your snippet into something that will survive dead code removal:
and compiling it:
shows no spills (ie. local memory). Further, disassembling the resulting cubin file shows:
ie. the
ushort4is stuffed into register and then a 64 bit store is used to write the packed vector out to global memory. No local memory access to be seen.So if you have convinced yourself that you have a vector value compiling into local memory, it is either because you have a kernel with a lot of register pressure, or you are asking the compiler to (the
volatilekeyword will do that), or you have misinterpreted what the compiler/assembler are telling you at compile time.EDIT: Using the CUDA 4.0 release tookit with Visual Studio Express 2008 and compiling on 32bit Windows 7 for a compute 1.1 device gives:
which is the exact same result as for the original build for a compute 2.0 target.