I’m wondering if fast implementation of pow(), for example this one, is a faster way to get square root of an integer than fast sqrt(x). We know that
sqrt(x) = pow(x, 0.5f)
I cannot test speed myself because I did not find fast implementation of sqrt.
My question is: Is fast implementation of pow(x, 0.5f) faster than fast sqrt(x) ?
Edit: I meant powf – pow that takes floats intead of doubles. (doubles are more misleading)
With regard to C standard library
sqrtandpow, the answer is no.First, if
pow(x, .5f)were faster than an implementation ofsqrt(x), the engineer assigned to maintain sqrt would replace the implementation withpow(x, .5f).Second, implementations of sqrt in commercial libraries are typically optimized specifically to perform that task, often by people who are knowledgeable about writing high-performance software and who write in or near assembly language to get the best performance available from the processor.
Third, many processors have instructions to perform sqrt or to assist in calculating it. (Commonly, there is an instruction to provide an estimate of the reciprocal of the square root and an instruction to refine that estimate.)
However
The code you linked/question you asked is about attempting a crude approximation of
sqrtusing a crudely approximatedpow.I converted the final version of the pow approximation routine referred to in the question to C and measured the run time of it when computing
pow(3, .5). I also measured the run-time of the system (Mac OS X 10.8) pow and sqrt and of the sqrt approximation here (with one iteration and multiplying by the argument at the end to get the square root, rather than its inverse).First, the computed results: The pow approximation returns 1.72101. The sqrt approximation returns 1.73054. The correct value, returned by the system pow and sqrt, is 1.73205.
Running in 64-bit mode on a MacPro4,1, the pow approximation takes about 6 cycles, the system pow takes 29 cycles, the square root approximation takes 10 cycles, and the system sqrt takes 29 cycles. These times may include some overhead for loading arguments and storing results (I used volatile variables to force the compiler not to optimize away otherwise useless loop iterations, so that I could measure them).
(These times are “effective throughput”, in effect the number of CPU cycles from when one call begins to when another can begin.)