I am using a Tesla C1060 with 1.3 compute capability and nvcc compiler driver 4.0. I am trying to do some computation local to thread block. Each thread block is provided with a shared array which is first initialized to zero values. For synchronizing concurrent updates (addition) to shared data by threads of the thread block, I use CUDA atomicAdd primitive.
Once each thread block is ready with the results in its shared data array, each entry in shared data array is iteratively merged (using atomicAdd) to corresponding entries in global data array.
Below is a code very similar to what I basically trying to do.
#define DATA_SZ 16
typedef unsigned long long int ULLInt;
__global__ void kernel( ULLInt* data, ULLInt ThreadCount )
{
ULLInt thid = threadIdx.x + blockIdx.x * blockDim.x;
__shared__ ULLInt sharedData[DATA_SZ];
// Initialize the shared data
if( threadIdx.x == 0 )
{
for( int i = 0; i < DATA_SZ; i++ ) { sharedData[i] = 0; }
}
__syncthreads();
//..some code here
if( thid < ThreadCount )
{
//..some code here
atomicAdd( &sharedData[getIndex(thid), thid );
//..some code here
for(..a loop...)
{
//..some code here
if(thid % 2 == 0)
{
// getIndex() returns a value in [0, DATA_SZ )
atomicAdd( &sharedData[getIndex(thid)], thid * thid );
}
}
}
__syncthreads();
if( threadIdx.x == 0 )
{
// ...
for( int i = 0; i < DATA_SZ; i++ ) { atomicAdd( &Data[i], sharedData[i] ); }
//...
}
}
If I compile with -arch=sm_20 I don’t get any errors. However when I compile the kernel using the -arch=sm_13 option I get the following errors:
ptxas /tmp/tmpxft_00004dcf_00000000-2_mycode.ptx, line error : Global state space expected for instruction 'atom'
ptxas /tmp/tmpxft_00004dcf_00000000-2_mycode.ptx, line error : Global state space expected for instruction 'atom'
ptxas fatal : Ptx assembly aborted due to errors
If I comment out the following two lines I don’t get any errors with -arch=sm_13:
atomicAdd( &sharedData[getIndex(thid), thid );
atomicAdd( &sharedData[getIndex(thid)], thid * thid );
Can someone suggest what I might be doing wrong?
Found the solution in CUDA C programming guide: Atomic functions operating on shared memory and atomic functions operating on 64-bit words are only available for devices of compute capability 1.2 and above. Atomic functions operating on 64-bit words in shared memory are only available for devices of compute capability 2.x and higher.
So basically I cannot use ULLInt fro shared memory here and somehow I need to use unsigned int