I have two files:
#include <stdio.h>
static inline void print0() { printf("Zero"); }
static inline void print1() { printf("One"); }
static inline void print2() { printf("Two"); }
static inline void print3() { printf("Three"); }
static inline void print4() { printf("Four"); }
int main()
{
unsigned int input;
scanf("%u", &input);
switch (input)
{
case 0: print0(); break;
case 1: print1(); break;
case 2: print2(); break;
case 3: print3(); break;
case 4: print4(); break;
}
return 0;
}
and
#include <stdio.h>
static inline void print0() { printf("Zero"); }
static inline void print1() { printf("One"); }
static inline void print2() { printf("Two"); }
static inline void print3() { printf("Three"); }
static inline void print4() { printf("Four"); }
int main()
{
unsigned int input;
scanf("%u", &input);
static void (*jt[])() = { print0, print1, print2, print3, print4 };
jt[input]();
return 0;
}
I expected them to be compiled to almost identical assembly code. In both cases jump tables are generated, but the calls in the first file are represented by jmp, while the calls in the second one by call. Why doesn’t compiler optimise calls? Is is possible to hint gcc that I would like to see jmps instead of calls?
Compiled with gcc -Wall -Winline -O3 -S -masm=intel, GCC version 4.6.2. GCC 4.8.0 produces slightly less code, but the problem still persists.
UPD: Defining jt as const void (* const jt[])() = { print0, print1, print2, print3, print4 }; and making the functions static const inline didn’t help: http://ideone.com/97SU0
The first case (through the
switch()) creates the following for me (Linux x86_64 / gcc 4.4):Note the
.rodatacontents@4006b8are printed network byte order (for whatever reason …), the value is40058ewhich is withinmainabove – where the arg-initializer/jmpblock starts. All themov/jmppairs in there use eight bytes, hence the(,%rax,8)indirection. In this case, the sequence is therefore:This means the compiler has actually optimized out the
staticcall sites – and instead merged them all into a single, inlinedprintf()call. The table use here is thejmp ...(,%rax,8)instruction, and the table contained within the program code.The second one (with the explicitly-created table) does the following for me:
Again, note the inverted byte order as objdump prints the data section – if you turn these around you get the function adresses for
print[0-4]().The compiler is invoking the target through an indirect
call– i.e. the table usage is directly in thecallinstruction, and the table has _explicitly been created as data.Edit:
If you change the source like this:
the created assembly for
main()becomes:which looks more like what you wanted ?
The reason for this is that you need “stackless” funcs to be able to do this – tail-recursion (returning from a function via
jmpinstead ofret) is only possible if you either have done all stack cleanup already, or don’t have to do any because you have nothing to clean up on the stack. The compiler can (but needs not) choose to clean up before the last function call (in which case the last call can be made byjmp), but that’s only possible if you return either the value you got from that function, or if you “returnvoid“. And, as said, if you actually use stack (like your example does for theinputvariable) there’s nothing that can make the compiler force to undo this in such a way that tail-recursion results.Edit2:
The disassembly for the first example, with the same changes (
argcinstead ofinputand forcingvoid main– no standard-conformance comments please this is a demo), results in the following assembly:This is worse in one way (does two
jmpinstead of one) but better in another (because it eliminates thestaticfunctions and inlines the code). Optimization-wise, the compiler has pretty much done the same thing.