Sorry for bad title. I could not come up with anything better.
Every example I have seen of CUDA programs has predefined data that is ready to be parallelized.
A common example is the sum of two matrices where the two matrices are already filled. But what about programs that generates new tasks. How do I model this in CUDA? How do I pass a result so other threads can begin working on it.
For example:
Say I run a kernel on one job. This job generates 10 new independant jobs. Each of them generates 10 new independant job and so on. This seems like a task that is highly parallel because each job is independant. The problem is I don’t know how to model this in CUDA.
I have tried doing it in CUDA where I used a while loop in a kernel to keep polling if a thread could begin computation. Each thread was assigned a job. But that did not work. It seemed to ignore the while loop.
Code example:
On host:
fill ready array with 0
ready[0] = 1;
On device:
__global__ void kernel(int *ready, int *result)
{
int tid = threadIdx.x;
if(tid < N)
{
int condition = ready[tid];
while(condition != 1)
{
condition = ready[tid];
}
result[tid] = 3;// later do real computation
//children jobs is now ready to work
int childIndex = tid * 10;
if(childIndex < (N-10))
{
ready[childIndex + 1] = 1; ready[childIndex + 2] = 1;
ready[childIndex + 3] = 1; ready[childIndex + 4] = 1;
ready[childIndex + 5] = 1; ready[childIndex + 6] = 1;
ready[childIndex + 7] = 1; ready[childIndex + 8] = 1;
ready[childIndex + 9] = 1; ready[childIndex +10] = 1;
}
}
}
You will want to use multiple kernel calls. Once a kernel job has finished and generated the work units for its children, the children can be executed in another kernel. You don’t want to poll with a while loop inside a cuda kernel anyways, even if it worked you would get terrible performance.
I would google the CUDA parallel reduction example. Shows how to decompose into multiple kernels. The only difference is instead of doing less work between kernels you will be doing more.