I’m looking at the FFT example on the CUDA SDK and I’m wondering: why the CUFFT is much faster when the half of the padded data is a power of two? (half because in frequency domain half is redundant)
What’s the point in having a power of two size to work on?
I think this is your answer. It’s using different algorithms
http://forums.nvidia.com/index.php?showtopic=195094
From the manual: http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/CUFFT_Library_3.1.pdf