How can I write a statement in my CUDA kernel that is executed by a single thread. For example if I have the following kernel:
__global__ void Kernel(bool *d_over, bool *d_update_flag_threads, int no_nodes)
{
int tid = blockIdx.x*blockDim.x + threadIdx.x;
if( tid<no_nodes && d_update_flag_threads[tid])
{
...
*d_over=true; // writing a single memory location, only 1 thread should do?
...
}
}
In above kernel, “d_over” is a single boolean flag while “d_update_flag_threads” is a boolean array.
What I normally did before is using the first thread in the thread block e.g.:
if(threadIdx.x==0)
but It could not work in this case as I have a flag array here and only threads with assosiated flag “true” will execute the if statement. That flag array is set by another CUDA kernel called before and I don’t have any knowledge about it in advance.
In short, I need something similar to “Single” construct in OpenMP.
A possible approach is use atomic operations. If you need only one thread per block to do the update, you could do the atomic operation in shared memory (for compute capability >= 1.2) which is generally much faster than perform it in global memory.
Said that, the idea is as follow:
It is just an idea. I’ve not tested it but is close to an operation performed by a single thread, not being the first thread of the block.