I have two logically equivalent functions:
long ipow1(int base, int exp) {
// HISTORICAL NOTE:
// This wasn't here in the original question, I edited it in,
if (exp == 0) return 1;
long result = 1;
while (exp > 1) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
}
return result * base;
}
long ipow2(int base, int exp) {
long result = 1;
while (exp) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
}
return result;
}
NOTICE:
These loops are equivalent because in the former case we are returning result * base (handling the case when exp is or has been reduced to 1) but in the second case we are returning result.
Strangely enough, both with -O3 and -O0 ipow1 consequently outperforms ipow2 by about 25%. How is this possible?
I’m on Windows 7, x64, gcc 4.5.2 and compiling with gcc ipow.c -O0 -std=c99.
And this is my profiling code:
int main(int argc, char *argv[]) {
LARGE_INTEGER ticksPerSecond;
LARGE_INTEGER tick;
LARGE_INTEGER start_ticks, end_ticks, cputime;
double totaltime = 0;
int repetitions = 10000;
int rep = 0;
int nopti = 0;
for (rep = 0; rep < repetitions; rep++) {
if (!QueryPerformanceFrequency(&ticksPerSecond)) printf("\tno go QueryPerformance not present");
if (!QueryPerformanceCounter(&tick)) printf("no go counter not installed");
QueryPerformanceCounter(&start_ticks);
/* start real code */
for (int i = 0; i < 55; i++) {
for (int j = 0; j < 11; j++) {
nopti = ipow1(i, j); // or ipow2
}
}
/* end code */
QueryPerformanceCounter(&end_ticks);
cputime.QuadPart = end_ticks.QuadPart - start_ticks.QuadPart;
totaltime += (double)cputime.QuadPart / (double)ticksPerSecond.QuadPart;
}
printf("\tTotal elapsed CPU time: %.9f sec with %d repetitions - %ld:\n", totaltime, repetitions, nopti);
return 0;
}
If you dont want to read all of this skip to the bottom, I come up with a 21% difference just by analysis of the code.
Different systems, versions of the compiler, same compiler version built by different folks/distros will give different instruction mixes, this is just one example of what you might get.
Isolating the loops:
Comparing something to zero normally saves you an instruction and you can see that here
Your timing method is going to generate a lot of error/chaos. Depending on the beat frequency of the loop and the accuracy of the timer you can create a lot of gain in one and a lot of loss in another. This method normally gives better accuracy:
starttime = …
for(rep=bignumber;rep;rep–)
{
//code under test
…
}
endtime = …
total = endtime – starttime;
Of course if you are running this on an operating system timing it is going to have a decent amount of error in it anyway.
Also you want to use volatile variables for your timer variables, helps the compiler to not re-arrange the order of execution. (been there seen that).
If we look at this from the perspective of the base multiplies:
there are
50% more for ipow2(). Actually it is not just the multiplies it is that you are going through the loop 50% more times.
ipow1() gets a little back on the other multiplies:
ipow1() performs the result*=base a different number (more) times than ipow2()
being a long * int can make these more expensive. not enough to make up for the losses around the loop in ipow2().
Even without disassembling, making a rough guess on the operations/instructions you hope the compiler uses. Accounting here for processors in general not necessarily x86, some processors will run this code better than others (from a number of instructions executed perspective not counting all the other factors).
Assuming I counted all the major operations and didnt unfairly give one function more than another:
ipow2 is 21% slower using this analysis.
I think the big killer is the 50% more times through the loop. Granted it is data dependent, you might find inputs in a benchmark test that make the difference between functions greater or worse than the 25% you are seeing.