I recently wrote a Vector 3 class, and I submitted my normalize() function for reviewal to a friend. He said it was good, but that I should multiply by the reciprocal where possible because “multiplying is cheaper than dividing” in CPU time.
My question simply is, why is that?
Think about it in terms of elementary operations that hardware can more easily implement — add, subtract, shift, compare. Multiplication even in a trivial setup requires fewer such elementary steps — plus, it afford advances algorithms that are even faster — see here for example… but hardware generally doesn’t take advantage of those (except maybe extremely specialized hardware). For example, as the wikipedia URL says, “Toom–Cook can do a size-N cubed multiplication for the cost of five size-N multiplications” — that’s pretty fast indeed for very large numbers (Fürer’s algorithm, a pretty recent development, can do
Θ(n ln(n) 2Θ(ln*(n)))— again, see the wikipedia page and links therefrom).Division’s just intrisically slower, as — again — per wikipedia; even the best algorithms (some of which ARE implemented in HW, just because they’re nowhere as sophisticated and complex as the very best algorithms for multiplication;-) can’t hold a candle to the multiplication ones.
Just to quantify the issue with not-so-huge numbers, here are some results with gmpy, an easy-to-use Python wrapper around GMP, which tends to have pretty good implementations of arithmetic though not necessarily the latest-and-greatest wheezes. On a slow (first-generation;-) Macbook Pro:
As you see, even at this small size (number of bits in the numbers), and with libraries optimized by exactly the same speed-obsessed people, multiplication by the reciprocal can save 1/3 of the time that division takes.
It may be only in rare situations that these few nanoseconds are a life-or-death issue, but, when they are, and of course IF you are repeatedly dividing by the same value (to amortize away the
1.0/boperation!), then this knowledge can be a life-saver.(Much in the same vein —
x*xwill often save time compared tox**2[in languages that have a**“raise to power” operator, like Python and Fortran] — and Horner’s scheme for polynomial computation is VASTLY preferable to repeated raise-to-power operations!-).