i am working on a code which needs to be time efficient and thus using Cufft for this purpose but when i try to compute fft of a very large data in parallel it is slower than cpu fftw and the reason i find after finding the time for every line of code using high precision timing code is that cudamalloc taking around 0.983 sec while the time for rest of the lines of code is around 0.00xx sec which is expected ….
i have gone through some of the related posts but according to them
the main delay with GPUs is due to memory transfer not memory allocation
And also in one of the posts it was written that
The very first call to any of the cuda library functions launches an initialisation subroutine
what is the actual reason of this delay …or is it not normal to have such delay in the execution of code???
Thanks in Advance
Is it possible that the large delay you are seeing (nearly 1s) is due to driver initialisation? It seems rather long for a cudaMalloc. Also check your driver is up-to-date.
The delay for the first kernel launch can be due to a number of factors:
The first of these is only applicable if you are running on a Linux system without X. In that case the driver is only loaded when required and unloaded afterwards. Running
nvidia-smi -pm 1as root will run the driver in persistent mode to avoid such delays, check outman nvidia-smifor details and remember to add this to an init script since it won’t persist across a reboot.The second delay is in compiling the PTX for the specific device architecture in your system. This is easily avoided by embedding the binary for your device architecture (or architectures if you want to support multiple archs without compiling PTX) into the executable. See the CUDA C Programming Guide (available on NVIDIA website) for more information, section 3.1.1.2 talks about JIT compilation.
The third point, context creation, is unavoidable but NVIDIA have gone to great effort to reduce the cost. Context creation involves copying the executable code to the device, copying any data objects, setting up the memory system etc.