I’m trying to benchmark the difference between a function pointer call and a virtual function call. To do this, I have written two pieces of code, that do the same mathematical computation over an array. One variant uses an array of pointers to functions and calls those in a loop. The other variant uses an array of pointers to a base class and calls its virtual function, which is overloaded in the derived classes to do absolutely the same thing as the functions in the first variant. Then I print the time elapsed and use a simple shell script to run the benchmark many times and compute the average run time.
Here is the code:
#include <iostream>
#include <cstdlib>
#include <ctime>
#include <cmath>
using namespace std;
long long timespecDiff(struct timespec *timeA_p, struct timespec *timeB_p)
{
return ((timeA_p->tv_sec * 1000000000) + timeA_p->tv_nsec) -
((timeB_p->tv_sec * 1000000000) + timeB_p->tv_nsec);
}
void function_not( double *d ) {
*d = sin(*d);
}
void function_and( double *d ) {
*d = cos(*d);
}
void function_or( double *d ) {
*d = tan(*d);
}
void function_xor( double *d ) {
*d = sqrt(*d);
}
void ( * const function_table[4] )( double* ) = { &function_not, &function_and, &function_or, &function_xor };
int main(void)
{
srand(time(0));
void ( * index_array[100000] )( double * );
double array[100000];
for ( long int i = 0; i < 100000; ++i ) {
index_array[i] = function_table[ rand() % 4 ];
array[i] = ( double )( rand() / 1000 );
}
struct timespec start, end;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
for ( long int i = 0; i < 100000; ++i ) {
index_array[i]( &array[i] );
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
unsigned long long time_elapsed = timespecDiff(&end, &start);
cout << time_elapsed / 1000000000.0 << endl;
}
and here is the virtual function variant:
#include <iostream>
#include <cstdlib>
#include <ctime>
#include <cmath>
using namespace std;
long long timespecDiff(struct timespec *timeA_p, struct timespec *timeB_p)
{
return ((timeA_p->tv_sec * 1000000000) + timeA_p->tv_nsec) -
((timeB_p->tv_sec * 1000000000) + timeB_p->tv_nsec);
}
class A {
public:
virtual void calculate( double *i ) = 0;
};
class A1 : public A {
public:
void calculate( double *i ) {
*i = sin(*i);
}
};
class A2 : public A {
public:
void calculate( double *i ) {
*i = cos(*i);
}
};
class A3 : public A {
public:
void calculate( double *i ) {
*i = tan(*i);
}
};
class A4 : public A {
public:
void calculate( double *i ) {
*i = sqrt(*i);
}
};
int main(void)
{
srand(time(0));
A *base[100000];
double array[100000];
for ( long int i = 0; i < 100000; ++i ) {
array[i] = ( double )( rand() / 1000 );
switch ( rand() % 4 ) {
case 0:
base[i] = new A1();
break;
case 1:
base[i] = new A2();
break;
case 2:
base[i] = new A3();
break;
case 3:
base[i] = new A4();
break;
}
}
struct timespec start, end;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
for ( int i = 0; i < 100000; ++i ) {
base[i]->calculate( &array[i] );
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
unsigned long long time_elapsed = timespecDiff(&end, &start);
cout << time_elapsed / 1000000000.0 << endl;
}
My system is LInux, Fedora 13, gcc 4.4.2. The code is compiled it with g++ -O3. The first one is test1, the second is test2.
Now I see this in console:
[Ignat@localhost circuit_testing]$ ./test2 && ./test2
0.0153142
0.0153166
Well, more or less, I think. And then, this:
[Ignat@localhost circuit_testing]$ ./test2 && ./test2
0.01531
0.0152476
Where are the 25% which should be visible? How can the first executable be even slower than the second one?
I’m asking this because I’m doing a project which involves calling a lot of small functions in a row like this in order to compute the values of an array, and the code I’ve inherited does a very complex manipulation to avoid the virtual function call overhead. Now where is this famous call overhead?
I think you’re seeing the difference, but it’s just the function call overhead. Branch misprediction, memory access and the trig functions are the same in both cases. Compared to those, it’s just not that big a deal, though the function pointer case was definitely a bit quicker when I tried it.
If this is representative of your larger program, this is a good demonstration that this type of microoptimization is sometimes just a drop in the ocean, and at worst futile. But leaving that aside, for a clearer test, the functions should perform some simpler operation, that is different for each function:
And so on, and similarly for the virtual functions.
(Each function should do something different, so that they don’t get elided and all end up with the same address; that would make the branch prediction work unrealistically well.)
With these changes, the results are a bit different. Best of 4 runs in each case. (Not very scientific, but the numbers are broadly similar for larger numbers of runs.) All timings are in cycles, running on my laptop. Code was compiled with VC++ (only changed the timing) but gcc implements virtual function calls in the same way so the relative timings should be broadly similar even with different OS/x86 CPU/compiler.
Function pointers: 2,052,770
Virtuals: 3,598,039
That difference seems a bit excessive! Sure enough, the two bits of code aren’t quite the same in terms of their memory access behaviour. The second one should have a table of 4 A *s, used to fill in base, rather than new’ing up a new one for each entry. Both examples will then have similar behaviour (1 cache miss/N entries) when fetching the pointer to jump through. For example:
With this in place, still using the simplified functions:
Virtuals (as suggested here): 2,487,699
So there’s 20%, best case. Close enough?
So perhaps your colleague was right to at least consider this, but I suspect that in any realistic program the call overhead won’t be enough of a bottleneck to be worth jumping through hoops over.