I am working on a (quite large) existing monothreaded C application. In this context I modified the application to perform some very few additional work consisting in incrementing a counter each time we call a special function (this function is called ~ 80.000 times). The application is compiled on an Ubuntu 12.04 running a 64 bits Linux kernel 3.2.0-31-generic with -O3 option.
Surprisingly the instrumented version of the code is running faster and I am investigating why.I measure execution time with clock_gettime(CLOCK_PROCESS_CPUTIME_ID) and to get representative results, I am reporting an average execution time value over 100 runs. Moreover, to avoid interference from outside world, I tried as much as possible to launch the application in a system without any other applications running (on a side note, because CLOCK_PROCESS_CPUTIME_ID returns process time and not wall clock time, other applications “should” in theory only affect cache and not directly the process execution time)
I was suspecting “instruction cache effects”, maybe the instrumented code that is a little bit larger (few bytes) fits differently and better in the cache, is this hypothesis conceivable ? I tried to do some cache investigations with valegrind –tool=cachegrind but unfortunately, the instrumented version has (as it seems logical) more cache misses than the initial version.
Any hints on this subject and ideas that may help to find why instrumented code is running faster are welcomes (some GCC optimizations available in one case and not in the other, why ?, …)
Since there are not many details in the question, I can only recommend some factors to consider while investigating the problem.
Very few additional work (such as incrementing a counter) might alter compiler’s decision on whether to apply some optimizations or not. Compiler has not always enough information to make perfect choice. It may try to optimize for speed where bottleneck is code size. It may try to auto-vectorize computations when there is not too much data to process. Compiler may not know what kind of data is to be processed or what is the exact model of CPU, that will execute the code.
Even if this change does not affect compiler optimization, it may alter the way how the code is executed by CPU.
To investigate effects of compiler optimization, you can compare generated assembler code before and after addition of counter-incrementing code.
To investigate CPU effects, use a profiler allowing to inspect processor performance counters.