I’m running the following benchmark:
int main(int argc, char **argv)
{
char *d = malloc(sizeof(char) * 13);
TIME_THIS(func_a(999, d), 99999999);
TIME_THIS(func_b(999, d), 99999999);
return 0;
}
with normal compilation, the results are the same for both functions
% gcc func_overhead.c func_overhead_plus.c -o func_overhead && ./func_overhead
[func_a(999, d) ] 9276227.73
[func_b(999, d) ] 9265085.90
but with -O3 they are very different
% gcc -O3 func_overhead.c func_overhead_plus.c -o func_overhead && ./func_overhead
[func_a(999, d) ] 178580674.69
[func_b(999, d) ] 48450175.29
func_a and func_b are defined like this:
char *func_a(uint64_t id, char *d)
{
register size_t i, j;
register char c;
for (i = 0, j = 36; i <= 11; i++)
if (i == 4 || i == 8)
d[i] = '/';
else {
c = ((id >> j) & 0xf) + '0';
if (c > '9')
c = c - '9' - 1 + 'A';
d[i] = c;
j -= 4;
}
d[12] = '\0';
return d;
}
the only difference is that func_a in the same file as main() and func_b is in the func_overhead_plus.c file
I’m wondering if anyone could elaborate on what’s going on
Thanks
Edit:
Sorry about all the confusion regarding the results. they are actually calls per second, so func_a() is faster than func_b() with -O3
TIME_THIS is defined like so:
double get_time(void)
{
struct timeval t;
gettimeofday(&t, NULL);
return t.tv_sec + t.tv_usec*1e-6;
}
#define TIME_THIS(func, runs) do { \
double t0, td; \
int i; \
t0 = get_time(); \
for (i = 0; i < runs; i++) \
func; \
td = get_time() - t0; \
printf("[%-35s] %15.2f\n", #func, runs / td); \
} while(0)
The architecture is Linux
Linux komiko 2.6.30-gentoo-r2 #1 SMP PREEMPT Wed Jul 15 17:27:51 IDT 2009 i686 Intel(R) Core(TM)2 Quad CPU Q8200 @ 2.33GHz GenuineIntel GNU/Linux
gcc is 4.3.3
as suggested, here are the results of mixing the calls a little
-O3
[func_b(999, d) ] 48926120.09
[func_a(999, d) ] 135299870.52
[func_b(999, d) ] 49075900.30
[func_a(999, d) ] 135748939.12
[func_b(999, d) ] 49039535.67
[func_a(999, d) ] 134055084.58
-O2
[func_b(999, d) ] 27243732.97
[func_a(999, d) ] 27341371.38
[func_b(999, d) ] 27303284.93
[func_a(999, d) ] 27349177.65
[func_b(999, d) ] 27325398.25
[func_a(999, d) ] 27343935.88
(-O1 and -Os were same as -O2 in this test)
no optimizations
[func_b(999, d) ] 8852314.88
[func_a(999, d) ] 9646166.81
[func_b(999, d) ] 8909973.33
[func_a(999, d) ] 9734883.99
[func_b(999, d) ] 8726127.49
[func_a(999, d) ] 9566052.21
looks like no optimizations behaves like -O3 in the way that func_a seems to be faster than func_b
just for fun, compiling with gcc 4.4.4 seems to be interesting
no optimizations
[func_b(999, d) ] 16982343.03
[func_a(999, d) ] 19693688.36
[func_b(999, d) ] 17260359.40
[func_a(999, d) ] 18137352.08
[func_b(999, d) ] 16790465.45
[func_a(999, d) ] 19828836.94
-O3
[func_b(999, d) ] 52184739.72
[func_a(999, d) ] 99999237556468.61
[func_b(999, d) ] 52430823.56
[func_a(999, d) ] 101030101.92
[func_b(999, d) ] 52404446.52
[func_a(999, d) ] 100842538.40
this is pretty weird, isn’t it?
Edit:
If the performance difference is indeed an inability of gcc4.3/4.4 to inline across objects, should it be considered a good practice to include performance critical function in the same file?
e.g
#include "performance_critical.c"
or is it just messy and most likely not really significant?
Thanks
Whenever you’re curious about what’s going on under the optimization hood, check out the -S option. This will let you examine the assembly output to see exactly what’s different between the two versions.
When a compiler is working within a single file (read: translation unit), it has access to all the types, objects, etc. that exist within (after preprocessing). When another file is brought into the mix, the compiler doesn’t know about the code in the first file. The linker, which puts the two object files together only sees symbol names and machine code.
In your case, the compiler is likely figuring out that how the pointers are used and realizes it can inline the function call in the first file. When you add in the second file, it MUST use pointers to communicate, so you get the added function call overhead.
Edit
torak pointed out that I interpreted this backwards. I don’t know why the single-file version would perform more slowly…