The following three pieces of code achieves exactly the same effect. Yet, when compiled

Question

0

Asked: May 26, 20262026-05-26T01:19:12+00:00 2026-05-26T01:19:12+00:00

The following three pieces of code achieves exactly the same effect. Yet, when compiled

0

The following three pieces of code achieves exactly the same effect. Yet, when compiled with -O3 on GCC 4.5.2 the times for a lot of iterations vary quite markedly.

1 – Normal branching, using multiple conditions, best time 1.0:

// a, b, c, d are set to random values 0-255 before each iteration.
if (a < 16 or b < 32 or c < 64 or d < 128) result += a+b+c+d;

2 – Branching, manually using bitwise or to check conditions, best time 0.92:

if (a < 16 | b < 32 | c < 64 | d < 128) result += a+b+c+d;

3 – Finally, getting the same result without a branch, best time 0.85:

result += (a+b+c+d) * (a < 16 | b < 32 | c < 64 | d < 128);

The above times are the best for each method when run as the the inner loop of a benchmark program I made. The random() is seeded the same way before each run.

Before I made this benchmark I assumed GCC would optimize away the differences. Especially the 2nd example makes me scratch my head. Can anyone explain why GCC doesn’t turn code like this into equivalent faster code?

EDIT: Fixed some errors, and also made it clear that the random numbers are created regardless, and used, so as to not be optimized away. They always were in the original benchmark, I just botched the code I put on here.

Here is an example of an actual benchmark function:

boost::random::mt19937 rng;
boost::random::uniform_int_distribution<> ranchar(0, 255);

double quadruple_or(uint64_t runs) {
  uint64_t result = 0;
  rng.seed(0);

  boost::chrono::high_resolution_clock::time_point start = 
    boost::chrono::high_resolution_clock::now();
  for (; runs; runs--) {
    int a = ranchar(rng);
    int b = ranchar(rng);
    int c = ranchar(rng);
    int d = ranchar(rng);
    if (a < 16 or b < 32 or c < 64 or d < 128) result += a;
    if (d > 16 or c > 32 or b > 64 or a > 128) result += b;
    if (a < 96 or b < 53 or c < 199 or d < 177) result += c;
    if (d > 66 or c > 35 or b > 99 or a > 77) result += d;
  }

  // Force gcc to not optimize away result.
  std::cout << "Result check " << result << std::endl;
  boost::chrono::duration<double> sec = 
    boost::chrono::high_resolution_clock::now() - start;
  return sec.count();
}

The full benchmark can be found here.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T01:19:13+00:00

The OP has changed a bit since my original answer. Let me try to revisit here.

In case 1, because of or short-circuiting I would expect the compiler to generate four compare-then-branch code sections. Branches can obviously be pretty expensive especially if they don’t go the predicted path.

In case 2, the compiler can decide to do all four comparisons, convert them to bool 0/1 results, and then bitwise or all four pieces together, then doing a single (additional) branch. This trades possibly more comparisons for possibly fewer branches. It appears as if reducing the number of branches does improve performance.

In case 3, things work pretty much the same as 2 except at the very end one more branch may be eliminated by explicitly telling the compiler “I know the result will be zero or one, so just multiply the thing on the left by that value”. The multiply apparently comes out faster than the corresponding branch on your hardware. This is in contrast to the second example where the compiler doesn’t know the range of possible outputs from the bitwise or so it has to assume it could be any integer and must do a compare-and-jump instead.

Original answer for history:
The first case is functionally different from the second and third if random has side effects (which a normal PRNG would), so it stands to reason that the compiler may optimize them differently. Specifically, the first case will only call random as many times as needed to pass the check while in the other two cases random will always be called four times. This will (assuming random really is stateful) result in the future random numbers being different.

The difference between the second and third is because the compiler probably can’t figure out for some reason that the result of the bitwise or will always be 0 or 1. When you give it a hint to do the multiplication instead of branching the multiplication probably comes out faster due to pipelining.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

The following three pieces of code achieves exactly the same effect. Yet, when compiled

1 – Normal branching, using multiple conditions, best time 1.0:

2 – Branching, manually using bitwise or to check conditions, best time 0.92:

3 – Finally, getting the same result without a branch, best time 0.85:

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply