I have an algorithm which consists two major tasks. Both tasks are embarrassingly parallel. So I can port this algorithm on CUDA by one of the following way.
>Kernel<<<
Block,Threads>>>() \\\For task1
cudaThreadSynchronize();
>Kerne2<<<
Block,Threads>>>() \\\For task2
Or I can do following thing.
>Kernel<<<
Block,Threads>>>()
{
1.Threads work on task 1.
2.syncronizes across device.
3.Start for task 2.
}
One can note that in first method, we’ll have to come back to CPU while in second trend we’ll have to use synchronization across all blocks in CUDA. Paper in IPDPS 10 says that second method, with proper care can perform better. But in general which method should be followed?
There is not currently any officially supported method for synchronizing across thread blocks withing a single kernel execution in the CUDA programming model. Methods of doing so, in my experience, lead to brittle code that can lead to incorrect behavior under changing circumstances such as running on different hardware, changing driver and CUDA release versions, etc.
Just because something is published in an academic publication does not mean it is a safe idea for production code.
I recommend you stick with your method 1, and I ask you this: have you determined that separating your computation into two separate kernels is really causing a performance problem? Is the cost of a second kernel launch definitely the bottleneck?