I’ve got libraries that do no expose any function except say “CreateObject”. Nevertheless all their functions are called indirectly, so I see in the perf report that up to 1.65% of time is spent in __i686.get_pc_thunk.bx. The functions (class methods) are called 160 million times and they are internal to the shared library, i.e. not exposed.
I wonder if it’s possible to compile internal methods without relocations – i.e. using relative offsets or something like that.
gcc is 4.5.2
UPDATE: Actually I think that was because of -O0 left in the makefile. So it’s not a big deal now, but I would still like to do the same with -O0, too, as it keeps less “garbage” for profiler. I wonder what is the -O2 “real” option that does this.
UPDATE2: hm, it wasn’t -O2, it was probably –dynamic-list that lower the pc_thunk performance hit a bit, but it’s still there… so not even sure if –dynamic-list really helps. Should hidden symbols still include indirection thunks, is it correct?
UPDATE3: I created a test project, for internal library function I set attribute visibility hidden, I compile with gcc 4.7 and -O2 and LTO enabled, I pass –dynamic-list to linker without the internal function in there, and nevertheless the call to get_pc_thunk is still there.
This is the code in the test shared library:
#include <stdio.h>
__attribute__((visibility("hidden"), noinline)) void lib1f2()
{
puts("I should have PLT disabled");
}
void lib1f()
{
puts("I'm lib1");
lib1f2();
}
In gdb I still see thunk inside lib1f2.
What’s funny is that with -fwhole-program lib1f2 is inlined into the main executable but still contains this call to the thunk.
UPDATE4: OK I’m getting close (to realize me being dumb), the program (and code above) uses data even if it’s just a const string, so it needs GOT calls. So the question now is:
- Still, can I avoid thunks for GOT?
- (related) via, maybe, compiling without -fPIC – what will be drawbacks?
I think not. Not on i686 at least. The problem is that code can automagically do relative jumps… or rather all the jumps on x86 are relative, beside indirect jump IIRC. On the other hand there is no way to index data relative to the current program counter. This problem is actually solved in x86_64, since there is a new instruction pointer relative addressing that can be used exactly for this cases.
Your test, compiled with gcc -fPIC -shared -O2 -flto
On 32 bit:
On 64 bit
Well, although it’s embarrassing I have to admit I’m slightly confused here. At first sight I would have said that a shared library bust be compiled with -fPIC. Instead, the following two commands both works
In the non -fPIC case the code does also not need any call to get_pc_thunk. The trick is that the dynamic loader fixes the library code at runtime with the right address to the data.
This is a problem though, since you gained some speed avoiding the thunks, but you lost the ability to actually share the shared library since the operating system must create a new copy for every code page of the library which contains a relocation. On the other hand, when a GOT is used only the GOT page(s) must be duplicated, greatly reducing the memory footprint when many applications link to the same library.
Interestingly enough, in 64bit mode is not possible to compile a library in non pic mode, the following command fails
Still, since there is processor provided support for code relative addressing this is not a problem.