Using gcc 4.6 with -O3, I have timed the following four codes using the simple time command
#include <iostream>
int main(int argc, char* argv[])
{
double val = 1.0;
unsigned int numIterations = 1e7;
for(unsigned int ii = 0;ii < numIterations;++ii) {
val *= 0.999;
}
std::cout<<val<<std::endl;
}
Case 1 runs in 0.09 seconds
#include <iostream>
int main(int argc, char* argv[])
{
double val = 1.0;
unsigned int numIterations = 1e8;
for(unsigned int ii = 0;ii < numIterations;++ii) {
val *= 0.999;
}
std::cout<<val<<std::endl;
}
Case 2 runs in 17.6 seconds
int main(int argc, char* argv[])
{
double val = 1.0;
unsigned int numIterations = 1e8;
for(unsigned int ii = 0;ii < numIterations;++ii) {
val *= 0.999;
}
}
Case 3 runs in 0.8 seconds
#include <iostream>
int main(int argc, char* argv[])
{
double val = 1.0;
unsigned int numIterations = 1e8;
for(unsigned int ii = 0;ii < numIterations;++ii) {
val *= 0.999999;
}
std::cout<<val<<std::endl;
}
Case 4 runs in 0.8 seconds
My question is why is the second case so much slower than all the other cases? Case 3 shows that removing the cout brings the runtime back in line with what is expected. And Case 4 shows that changing the multiplier also greatly reduces the runtime. What optimization or optimizations are not occuring in case 2 and why?
Update:
When I originally ran these tests there was no separate variable numIterations, the value was hard-coded in the for loop. In general, hard-coding this value made things run slower than the cases given here. This is especially true for Case 3 which ran almost instantly with the numIterations variable as shown above, indicating James McNellis is correct about the entire loop being optimized out. I’m not sure why hard-coding the 1e8 into the for loop prevents the removal of the loop in Case 3 or makes things slower in the other cases, however, the basic premise of Case 2 being significantly slower is even more true.
Diffing the assembly output gives for the cases above gives
Case 2 and Case 1:
movl $100000000, 16(%esp)
movl $10000000, 16(%esp)
Case 2 and Case 4:
.long -652835029
.long 1072691150
.long -417264663
.long 1072693245
René Richter was on the right track regarding underflow. The smallest positive normalized number is about 2.2e-308. With f(n)=0.999**n, this limit is reached after about 708,148 iterations. The remaining iterations are stuck with unnormalized computations.
This explains why 100 million iterations take slightly more than 10 times the time needed to perform 10 million. The first 700,000 are done using the floating point hardware. Once you hit denormalized numbers, the floating point hardware punts; the multiplication is done in software.
Note that this would not be the case if the repeated computation properly calculated 0.999**N. Eventually the product would reach zero, and from that point on the multiplications would once again be done with the floating point hardware. That is not what happens because 0.999 * (smallest denormalized number) is the smallest denormalized number. The continued product eventually bottoms out.
What we can do is change the exponent. An exponent of 0.999999 will keep the continued product within the realm of normalized numbers for 708 million iterations. Taking advantage of this,
Here you can easily see how much faster the floating point hardware is than is the software emulation.