I have a CUDA kernel where every thread traverse a tree. Because of this I have a while loop which is looped until the thread reaches a leaf. In every step down the tree it checks which of the children it should pick to follow.
The code is as follows:
__global__ void search(float* centroids, float* features, int featureCount, int *votes)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if(tid < featureCount)
{
int index = 0;
while (index < N)
{
votes[tid] = index;
int childIndex = index * CHILDREN + 1;
float minValue = FLT_MAX;
if(childIndex >= (N-CHILDREN)) break;
for(int i = 0; i < CHILDREN; i++)
{
int centroidIndex = childIndex + i;
float value = distance(centroids, features, centroidIndex, tid);
if(value < minValue)
{
minValue = value;
index = childIndex + i;
}
}
}
tid += blockDim.x * gridDim.x;
}
}
__device__ float distance(float* a, float* b, int aIndex, int bIndex)
{
float sum = 0.0f;
for(int i = 0; i < FEATURESIZE; i++)
{
float val = a[aIndex + i] - b[bIndex + i];
sum += val * val;
}
return sum;
}
This code goes into an infinite loop. That is what I find weird.
If I change the distance method to return a constant it works(ie. traversing left in the tree).
Have I missed something with loops in CUDA or is there just some hidden bug I can’t see? Because I don’t see how the code can go into an infinite loop.
Loops in CUDA C++ have the same semantics as they do in C++, so there must be a bug somewhere in your code. One strategy for debugging it would be to do so on the host.
First, because your code is scalar (e.g., it contains no calls to
__syncthreads), you can refactor it into__host__ __device__functions.distancecontains no CUDA-specific identifiers or functions, so you can simply prepend__host__:To refactor your
searchfunction, hoisttid(which depends on the CUDA-specific identifiersthreadIndexet al.) outside of it into a parameter, and make it a__host__ __device__function:Now write a
__global__function which does nothing except calculatetidand callsearch:Because
searchis now__host__ __device__, you can debug it by calling it from the CPU, emulating what a kernel launch would do:It should hang on the host exactly as it would on the device. Stick a
printfinside to find out where. Of course, you need to be sure to make host-side copies of your arrays such ascentroids, because the host cannot dereference pointers to device memory.Even though
printfis available to use from__device__functions with newer hardware, the reason you might prefer this approach is that calls toprintffrom a kernel do not commit until after the kernel retires. If the kernel never retires (as it apparently does not in your case) then your debugging output will never appear on the screen.