I am working on a (quite large) existing monothreaded C application. In this context

Question

0

Asked: June 12, 20262026-06-12T07:28:08+00:00 2026-06-12T07:28:08+00:00

I am working on a (quite large) existing monothreaded C application. In this context

0

I am working on a (quite large) existing monothreaded C application. In this context I modified the application to perform some very few additional work consisting in incrementing a counter each time we call a special function (this function is called ~ 80.000 times). The application is compiled on an Ubuntu 12.04 running a 64 bits Linux kernel 3.2.0-31-generic with -O3 option.

Surprisingly the instrumented version of the code is running faster and I am investigating why.I measure execution time with clock_gettime(CLOCK_PROCESS_CPUTIME_ID) and to get representative results, I am reporting an average execution time value over 100 runs. Moreover, to avoid interference from outside world, I tried as much as possible to launch the application in a system without any other applications running (on a side note, because CLOCK_PROCESS_CPUTIME_ID returns process time and not wall clock time, other applications “should” in theory only affect cache and not directly the process execution time)

I was suspecting “instruction cache effects”, maybe the instrumented code that is a little bit larger (few bytes) fits differently and better in the cache, is this hypothesis conceivable ? I tried to do some cache investigations with valegrind –tool=cachegrind but unfortunately, the instrumented version has (as it seems logical) more cache misses than the initial version.

Any hints on this subject and ideas that may help to find why instrumented code is running faster are welcomes (some GCC optimizations available in one case and not in the other, why ?, …)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-12T07:28:09+00:00

Since there are not many details in the question, I can only recommend some factors to consider while investigating the problem.

Very few additional work (such as incrementing a counter) might alter compiler’s decision on whether to apply some optimizations or not. Compiler has not always enough information to make perfect choice. It may try to optimize for speed where bottleneck is code size. It may try to auto-vectorize computations when there is not too much data to process. Compiler may not know what kind of data is to be processed or what is the exact model of CPU, that will execute the code.

Incrementing a counter may increase size of some loop and prevent loop unrolling. This may decrease code size (and improve code locality, which is good for instruction or microcode caches or for loop buffer and allows CPU to fetch/decode instructions quickly).
Incrementing a counter may increase size of some function and prevent inlining. This also may decrease code size.
Incrementing a counter may prevent auto-vectorization, which again may decrease code size.

Even if this change does not affect compiler optimization, it may alter the way how the code is executed by CPU.

If you insert counter-incrementing code in place, full of branch targets, this may make branch targets less dense and improve branch prediction.
If you insert counter-incrementing code in front of some particular branch target, this may make branch target’s address better aligned and make code fetch faster.
If you place counter-incrementing code after some data is written but before the same data is loaded again (and store-to-load forwarding did not work for some reason), the load operation may be completed earlier.
Insertion of counter-incrementing code may prevent two conflicting load attempts to the same bank in L1 data cache.
Insertion of counter-incrementing code may alter some CPU scheduler decision and make some execution port available just in time for some performance-critical instruction.

To investigate effects of compiler optimization, you can compare generated assembler code before and after addition of counter-incrementing code.

To investigate CPU effects, use a profiler allowing to inspect processor performance counters.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am working on a (quite large) existing monothreaded C application. In this context

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply