I was reading ‘C++ Template complete guide’ book, part about meta programming. There is an example of loop unroll (17.7). I’ve implemented the program for dot product calculations:
#include <iostream>
#include <sys/time.h>
using namespace std;
template<int DIM, typename T>
struct Functor
{
static T dot_product(T *a, T *b)
{
return *a * *b + Functor<DIM - 1, T>::dot_product(a + 1, b + 1);
}
};
template<typename T>
struct Functor<1, T>
{
static T dot_product(T *a, T *b)
{
return *a * *b;
}
};
template<int DIM, typename T>
T dot_product(T *a, T *b)
{
return Functor<DIM, T>::dot_product(a, b);
}
double dot_product(int DIM, double *a, double *b)
{
double res = 0;
for (int i = 0; i < DIM; ++i)
{
res += a[i] * b[i];
}
return res;
}
int main(int argc, const char * argv[])
{
static const int DIM = 100;
double a[DIM];
double b[DIM];
for (int i = 0; i < DIM; ++i)
{
a[i] = i;
b[i] = i;
}
{
timeval startTime;
gettimeofday(&startTime, 0);
for (int i = 0; i < 100000; ++i)
{
double res = dot_product<DIM>(a, b);
//double res = dot_product(DIM, a, b);
}
timeval endTime;
gettimeofday(&endTime, 0);
double tS = startTime.tv_sec * 1000000 + startTime.tv_usec;
double tE = endTime.tv_sec * 1000000 + endTime.tv_usec;
cout << "template time: " << tE - tS << endl;
}
{
timeval startTime;
gettimeofday(&startTime, 0);
for (int i = 0; i < 100000; ++i)
{
double res = dot_product(DIM, a, b);
}
timeval endTime;
gettimeofday(&endTime, 0);
double tS = startTime.tv_sec * 1000000 + startTime.tv_usec;
double tE = endTime.tv_sec * 1000000 + endTime.tv_usec;
cout << "loop time: " << tE - tS << endl;
}
return 0;
}
I’m using xcode and I turned all code optimisations off. I expected that template version have to be faster then simple loop according to the book. But the results are (t – Template, l = Loop):
DIM 5: t = ~5000, l = ~3500
DIM 50: t = ~55000, l = 16000
DIM 100: t = 130000, l = 36000
Also i’ve tried to make template functions inline with no performance difference.
Why simple loop is so much faster?
Depending on the compiler, if you don’t turn on performance optimizations, loop unrolling might not occur.
It’s pretty easy to understand why: your recursive template instantiations are basically creating a series of functions. The compiler can’t turn all of that into an inlined, unrolled loop and still keep sensible debugging information available. Suppose a segfault happens somewhere inside one of your functions, or an exception is thrown? Wouldn’t you want to be able to get a stack-trace that showed each frame? The compiler thinks you might want that, unless you turn on optimizations, which gives your compiler permission to go to town on your code.