I have hot spots in my code where I’m doing pow() taking up around 10-20% of my execution time.
My input to pow(x,y) is very specific, so I’m wondering if there’s a way to roll two pow() approximations (one for each exponent) with higher performance:
- I have two constant exponents: 2.4 and 1/2.4.
- When the exponent is 2.4, x will be in the range (0.090473935, 1.0].
- When the exponent is 1/2.4, x will be in the range (0.0031308, 1.0].
- I’m using SSE/AVX
floatvectors. If platform specifics can be taken advantage of, right on!
A maximum error rate around 0.01% is ideal, though I’m interested in full precision (for float) algorithms as well.
I’m already using a fast pow() approximation, but it doesn’t take these constraints into account. Is it possible to do better?
In the IEEE 754 hacking vein, here is another solution which is faster and less “magical.” It achieves an error margin of .08% in about a dozen clock cycles (for the case of p=2.4, on an Intel Merom CPU).
Floating point numbers were originally invented as an approximation to logarithms, so you can use the integer value as an approximation of
log2. This is somewhat-portably achievable by applying the convert-from-integer instruction to a floating-point value, to obtain another floating-point value.To complete the
powcomputation, you can multiply by a constant factor and convert the logarithm back with the convert-to-integer instruction. On SSE, the relevant instructions arecvtdq2psandcvtps2dq.It’s not quite so simple, though. The exponent field in IEEE 754 is signed, with a bias value of 127 representing an exponent of zero. This bias must be removed before you multiply the logarithm, and re-added before you exponentiate. Furthermore, bias adjustment by subtraction won’t work on zero. Fortunately, both adjustments can be achieved by multiplying by a constant factor beforehand.
exp2( 127 / p - 127 )is the constant factor. This function is rather specialized: it won’t work with small fractional exponents, because the constant factor grows exponentially with the inverse of the exponent and will overflow. It won’t work with negative exponents. Large exponents lead to high error, because the mantissa bits are mingled with the exponent bits by the multiplication.But, it’s just 4 fast instructions long. Pre-multiply, convert from “integer” (to logarithm), power-multiply, convert to “integer” (from logarithm). Conversions are very fast on this implementation of SSE. We can also squeeze an extra constant coefficient into the first multiplication.
A few trials with exponent = 2.4 show this consistently overestimates by about 5%. (The routine is always guaranteed to overestimate.) You could simply multiply by 0.95, but a few more instructions will get us about 4 decimal digits of accuracy, which should be enough for graphics.
The key is to match the overestimate with an underestimate, and take the average.
rsqrtps. (This is quite accurate enough, but does sacrifice the ability to work with zero.)mulps.rsqrtps.mulps.mulps.mulps. This is the overestimate.mulps. This is the underestimate.addps, onemulps.Instruction tally: fourteen, including two conversions with latency = 5 and two reciprocal square root estimates with throughput = 4.
To properly take the average, we want to weight the estimates by their expected errors. The underestimate raises the error to a power of 0.6 vs 0.4, so we expect it to be 1.5x as erroneous. Weighting doesn’t add any instructions; it can be done in the pre-factor. Calling the coefficient a: a^0.5 = 1.5 a^-0.75, and a = 1.38316186.
The final error is about .015%, or 2 orders of magnitude better than the initial
fastpowresult. The runtime is about a dozen cycles for a busy loop withvolatilesource and destination variables… although it’s overlapping the iterations, real-world usage will also see instruction-level parallelism. Considering SIMD, that’s a throughput of one scalar result per 3 cycles!Well… sorry I wasn’t able to post this sooner. And extending it to x^1/2.4 is left as an exercise ;v) .
Update with stats
I implemented a little test harness and two x(5⁄12) cases corresponding to the above.
Output:
I suspect accuracy of the more accurate 5/12 is being limited by the
rsqrtoperation.