A follow up Q from: CUDA: Calling a __device__ function from a kernel
I’m trying to speed up a sort operation. A simplified pseudo version follows:
// some costly swap operation
__device__ swap(float* ptrA, float* ptrB){
float saveData; // swap some
saveData= *Adata; // big complex
*Adata= *Bdata // data chunk
*Bdata= saveData;
}
// a rather simple sort operation
__global__ sort(float data[]){
for (i=0; i<limit: i++){
find left swap point
find right swap point
swap<<<1,1>>>(left, right);
}
}
(Note: This simple version doesn’t show the reduction techniques in the blocks.)
The idea is that it is easy (fast) to identify the swap points. The swap operation is costly (slow). So use one block to find/identify the swap points. Use other blocks to do the swap operations. i.e. Do the actual swapping in parallel.
This sounds like a decent plan. But if the compiler in-lines the device calls, then there is no parallel swapping taking place.
Is there a way to tell the compiler to NOT in-line a device call?
Edit (2016):
Dynamic parallelism was introduced in the second generation of Kepler architecture GPUs. Launching kernels in the device is supported on compute capability 3.5 and higher devices.
Original Answer:
You will have to wait until the end of the year when the next generation of hardware is available. No current CUDA devices can launch kernels from other kernels – it is presently unsupported.