Sorry for bad title. I could not come up with anything better. Every example

Question

0

Asked: May 26, 20262026-05-26T09:23:16+00:00 2026-05-26T09:23:16+00:00

Sorry for bad title. I could not come up with anything better. Every example

0

Sorry for bad title. I could not come up with anything better.

Every example I have seen of CUDA programs has predefined data that is ready to be parallelized.
A common example is the sum of two matrices where the two matrices are already filled. But what about programs that generates new tasks. How do I model this in CUDA? How do I pass a result so other threads can begin working on it.

For example:
Say I run a kernel on one job. This job generates 10 new independant jobs. Each of them generates 10 new independant job and so on. This seems like a task that is highly parallel because each job is independant. The problem is I don’t know how to model this in CUDA.
I have tried doing it in CUDA where I used a while loop in a kernel to keep polling if a thread could begin computation. Each thread was assigned a job. But that did not work. It seemed to ignore the while loop.

Code example:

On host:
fill ready array with 0
ready[0] = 1;

On device:
__global__ void kernel(int *ready, int *result)
{
    int tid = threadIdx.x;
    if(tid < N)
    {
        int condition = ready[tid];
        while(condition != 1)
        {
            condition = ready[tid];
        }

        result[tid] = 3;// later do real computation

        //children jobs is now ready to work
        int childIndex = tid * 10;
        if(childIndex < (N-10))
        {
            ready[childIndex + 1] = 1; ready[childIndex + 2] = 1;
            ready[childIndex + 3] = 1; ready[childIndex + 4] = 1;
            ready[childIndex + 5] = 1; ready[childIndex + 6] = 1;
            ready[childIndex + 7] = 1; ready[childIndex + 8] = 1;
            ready[childIndex + 9] = 1; ready[childIndex +10] = 1;
        }
    }
}

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T09:23:16+00:00

You will want to use multiple kernel calls. Once a kernel job has finished and generated the work units for its children, the children can be executed in another kernel. You don’t want to poll with a while loop inside a cuda kernel anyways, even if it worked you would get terrible performance.

I would google the CUDA parallel reduction example. Shows how to decompose into multiple kernels. The only difference is instead of doing less work between kernels you will be doing more.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Sorry for bad title. I could not come up with anything better. Every example

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply