I want to use __syncthreads() to a recursion like
__device__ void foo(int k) {
if (some_condition) {
for (int i=0;i<8;i++) {
foo(i+k); // foo might take longer with some inputs
__syncthreads();
}
}
}
How does this __syncthreads() now apply? I know it is only applied within a block. As far as I understand it, this holds for all local threads independently of the recursion depth? But what if I wanted to make sure that this __syncthreads() to a certain recursion depths? Is that even possible? I could check for the recursion depth, but I believe that won’t work either.
Are there possible alternatives?
I’ve seen the that there are 3 syncthread extensions for CUDA Device >= 2.0
int __syncthreads_count(int predicate);
int __syncthreads_and(int predicate);
int __syncthreads_or(int predicate);
But I don’t think they will help since they seem like an atomic counter.
As you know,
__syncthreads()is only safe where all threads within a block reach the barrier. This means that if you are calling__syncthreads()from within a condition the condition must evaluate to the same on all threads within a block.For
__syncthreads()within recursion, this means that all threads within a block must execute the recursion to the same depth, otherwise not all threads will be reaching the same barrier.