Using gcc 4.6 with -O3, I have timed the following four codes using the

Question

0

Asked: May 23, 20262026-05-23T14:46:09+00:00 2026-05-23T14:46:09+00:00

Using gcc 4.6 with -O3, I have timed the following four codes using the

0

Using gcc 4.6 with -O3, I have timed the following four codes using the simple time command

#include <iostream>
int main(int argc, char* argv[])
{
  double val = 1.0; 
  unsigned int numIterations = 1e7; 
  for(unsigned int ii = 0;ii < numIterations;++ii) {
    val *= 0.999;
  }
  std::cout<<val<<std::endl;
}

Case 1 runs in 0.09 seconds

#include <iostream>
int main(int argc, char* argv[])
{
  double val = 1.0;
  unsigned int numIterations = 1e8;
  for(unsigned int ii = 0;ii < numIterations;++ii) {
    val *= 0.999;
  }
  std::cout<<val<<std::endl;
}

Case 2 runs in 17.6 seconds

int main(int argc, char* argv[])
{
  double val = 1.0;
  unsigned int numIterations = 1e8;
  for(unsigned int ii = 0;ii < numIterations;++ii) {
    val *= 0.999;
  }
}

Case 3 runs in 0.8 seconds

#include <iostream>
int main(int argc, char* argv[])
{
  double val = 1.0;
  unsigned int numIterations = 1e8;
  for(unsigned int ii = 0;ii < numIterations;++ii) {
    val *= 0.999999;
  }
  std::cout<<val<<std::endl;
}

Case 4 runs in 0.8 seconds

My question is why is the second case so much slower than all the other cases? Case 3 shows that removing the cout brings the runtime back in line with what is expected. And Case 4 shows that changing the multiplier also greatly reduces the runtime. What optimization or optimizations are not occuring in case 2 and why?

Update:

When I originally ran these tests there was no separate variable numIterations, the value was hard-coded in the for loop. In general, hard-coding this value made things run slower than the cases given here. This is especially true for Case 3 which ran almost instantly with the numIterations variable as shown above, indicating James McNellis is correct about the entire loop being optimized out. I’m not sure why hard-coding the 1e8 into the for loop prevents the removal of the loop in Case 3 or makes things slower in the other cases, however, the basic premise of Case 2 being significantly slower is even more true.

Diffing the assembly output gives for the cases above gives

Case 2 and Case 1:
movl $100000000, 16(%esp)

movl $10000000, 16(%esp)

Case 2 and Case 4:
.long -652835029
.long 1072691150

.long -417264663
.long 1072693245

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T14:46:10+00:00

René Richter was on the right track regarding underflow. The smallest positive normalized number is about 2.2e-308. With f(n)=0.999**n, this limit is reached after about 708,148 iterations. The remaining iterations are stuck with unnormalized computations.

This explains why 100 million iterations take slightly more than 10 times the time needed to perform 10 million. The first 700,000 are done using the floating point hardware. Once you hit denormalized numbers, the floating point hardware punts; the multiplication is done in software.

Note that this would not be the case if the repeated computation properly calculated 0.999**N. Eventually the product would reach zero, and from that point on the multiplications would once again be done with the floating point hardware. That is not what happens because 0.999 * (smallest denormalized number) is the smallest denormalized number. The continued product eventually bottoms out.

What we can do is change the exponent. An exponent of 0.999999 will keep the continued product within the realm of normalized numbers for 708 million iterations. Taking advantage of this,

Case A  : #iterations = 1e7, exponent=0.999, time=0.573692 sec
Case B  : #iterations = 1e8, exponent=0.999, time=6.10548 sec
Case C  : #iterations = 1e7, exponent=0.999999, time=0.018867 sec
Case D  : #iterations = 1e8, exponent=0.999999, time=0.189375 sec

Here you can easily see how much faster the floating point hardware is than is the software emulation.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Using gcc 4.6 with -O3, I have timed the following four codes using the

Update:

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply