I’m looking at the FFT example on the CUDA SDK and I’m wondering: why

Question

0

Asked: May 20, 20262026-05-20T23:42:46+00:00 2026-05-20T23:42:46+00:00

I’m looking at the FFT example on the CUDA SDK and I’m wondering: why

0

I’m looking at the FFT example on the CUDA SDK and I’m wondering: why the CUFFT is much faster when the half of the padded data is a power of two? (half because in frequency domain half is redundant)

What’s the point in having a power of two size to work on?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-20T23:42:46+00:00

I think this is your answer. It’s using different algorithms

http://forums.nvidia.com/index.php?showtopic=195094

“I have been working on a similar
problem. In the cuFFT manual, it is
explained that cuFFT uses two
different algorithms for implementing
the FFTs. One is the Cooley-Tuckey
method and the other is the Bluestein
algorithm. When the dimensions have
prime factors of only 2,3,5 and 7 e.g
(675 = 3^3 x 5^5), then 675 x 675
performs much much better than say 674
x 674 or 677 x 677. This is done using
the Cooley-Tuckey method. If one of
the prime factors is a prime other
than 2,3,5 or 7, then the FFT for that
number is implemented using the
Bluestein method. The Bluestein method
is slower and there is also some
precision loss. “

From the manual: http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/CUFFT_Library_3.1.pdf

The CUFFT library implements several
FFT algorithms, each having different
performance and accuracy. The best
performance paths correspond to
transform sizes that meet two
criteria:

Fit in CUDAʹs shared
memory

Are powers of a single factor
(for example, powers of two)

These
transforms are also the most accurate
due to the numeric stability of the
chosen FFT algorithm. For transform
sizes that meet the first criterion
but not second, CUFFT uses a more
general mixed‐radix FFT algorithm that
is usually slower and less numerically
accurate. Therefore, if possible it is
best to use sizes that are powers of
two or four, or powers of other small
primes (such as, three, five, or
seven). In addition, the power‐of‐two
FFT algorithm in CUFFT makes maximum
use of shared memory by blocking
sub‐transforms for signals that do not
meet the first criterion.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m looking at the FFT example on the CUDA SDK and I’m wondering: why

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply