Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6868559
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 27, 20262026-05-27T03:26:56+00:00 2026-05-27T03:26:56+00:00

I have a CUDA kernel where every thread traverse a tree. Because of this

  • 0

I have a CUDA kernel where every thread traverse a tree. Because of this I have a while loop which is looped until the thread reaches a leaf. In every step down the tree it checks which of the children it should pick to follow.

The code is as follows:

__global__ void search(float* centroids, float* features, int featureCount, int *votes)
{
    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if(tid < featureCount)
    {
        int index = 0;
        while (index < N) 
        {
            votes[tid] = index;
            int childIndex = index * CHILDREN + 1;
            float minValue = FLT_MAX;

            if(childIndex >= (N-CHILDREN)) break;

            for(int i = 0; i < CHILDREN; i++)
            {
                int centroidIndex = childIndex + i;
                float value = distance(centroids, features, centroidIndex, tid);
                if(value < minValue)
                {
                    minValue = value;
                    index = childIndex + i;
                }
            }
        }
        tid += blockDim.x * gridDim.x;
    }
}

__device__ float distance(float* a, float* b, int aIndex, int bIndex)
{
    float sum = 0.0f;
    for(int i = 0; i < FEATURESIZE; i++)
    {
        float val = a[aIndex + i] - b[bIndex + i];
        sum += val * val;
    }

    return sum;
}

This code goes into an infinite loop. That is what I find weird.
If I change the distance method to return a constant it works(ie. traversing left in the tree).

Have I missed something with loops in CUDA or is there just some hidden bug I can’t see? Because I don’t see how the code can go into an infinite loop.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-27T03:26:57+00:00Added an answer on May 27, 2026 at 3:26 am

    Loops in CUDA C++ have the same semantics as they do in C++, so there must be a bug somewhere in your code. One strategy for debugging it would be to do so on the host.

    First, because your code is scalar (e.g., it contains no calls to __syncthreads), you can refactor it into __host__ __device__ functions.

    distance contains no CUDA-specific identifiers or functions, so you can simply prepend __host__:

    __host__ __device__ float distance(float* a, float* b, int aIndex, int bIndex);
    

    To refactor your search function, hoist tid (which depends on the CUDA-specific identifiers threadIndex et al.) outside of it into a parameter, and make it a __host__ __device__ function:

    __host__ __device__ void search(int tid, float* centroids, float* features, int featureCount, int *votes)
    {
      if(tid < featureCount)
      {
        int index = 0;
        while (index < N) 
        {
          votes[tid] = index;
          int childIndex = index * CHILDREN + 1;
          float minValue = FLT_MAX;
    
          if(childIndex >= (N-CHILDREN)) break;
    
          for(int i = 0; i < CHILDREN; i++)
          {
            int centroidIndex = childIndex + i;
            float value = distance(centroids, features, centroidIndex, tid);
            if(value < minValue)
            {
              minValue = value;
              index = childIndex + i;
            }
          }
        }
      }
    }
    

    Now write a __global__ function which does nothing except calculate tid and call search:

    __global__ void search_kernel(float *centroids, float features, int featureCount, int *votes)
    {
      int tid = threadIdx.x + blockIdx.x * blockDim.x;
      search(tid, centroids, features, featureCount, votes); 
    }
    

    Because search is now __host__ __device__, you can debug it by calling it from the CPU, emulating what a kernel launch would do:

    for(int tid = 0; tid < featureCount; ++tid)
    {
      search(tid, centroids, features, featureCount, votes);
    }
    

    It should hang on the host exactly as it would on the device. Stick a printf inside to find out where. Of course, you need to be sure to make host-side copies of your arrays such as centroids, because the host cannot dereference pointers to device memory.

    Even though printf is available to use from __device__ functions with newer hardware, the reason you might prefer this approach is that calls to printf from a kernel do not commit until after the kernel retires. If the kernel never retires (as it apparently does not in your case) then your debugging output will never appear on the screen.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have this c++ project in which I call a cuda kernel by means
I have a CUDA kernel which I'm compiling to a cubin file without any
I have a CUDA program that calls the kernel repeatedly within a for loop.
In a CUDA kernel, I have code similar to the following. I am trying
I have a data object that looks like this: { 'node-16': { 'tags': ['cuda'],
I was following this: Dynamically allocating memory inside __device/global__ CUDA kernel But it still
I am having a weird problem .. I have written a CUDA code which
Kernel launches in CUDA are generally asynchronous, which (as I understand) means that once
I have a kernel which runs twice with different grid size. My problem is
I have some code that I want to make into a cuda kernel. Behold:

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.