How well does NVCC optimize device code? Does it do any sort of optimizations like constant folding and common subexpression elimination?
E.g, will it reduce the following:
float a = 1 / sqrtf(2 * M_PI);
float b = c / sqrtf(2 * M_PI);
to this:
float sqrt_2pi = sqrtf(2 * M_PI); // Compile time constant
float a = 1 / sqrt_2pi;
float b = c / sqrt_2pi;
What about more clever optimizations, involving knowing semantics of math functions:
float a = 1 / sqrtf(c * M_PI);
float b = c / sqrtf(M_PI);
to this:
float sqrt_pi = sqrtf(M_PI); // Compile time constant
float a = 1 / (sqrt_pi * sqrtf(c));
float b = c / sqrt_pi;
The compiler is way ahead of you. In your example:
nvopencc (Open64) will emit this:
which is equivalent to
The second case gets compiled to this:
I am guessing the expression for
agenerated by the compiler should be more accurate than your “optmized” version, but about the same speed.