I want to speed up an array multiplication in C99.
This is the original for loops:
for(int i=0;i<n;i++) {
for(int j=0;j<m;j++) {
total[j]+= w[j][i] * x[i];
}
}
My boss asked my to try this, but it did not improve the speed:
for(int i=0;i<n;i++) {
float value = x[i];
for(int j=0;j<m;j++) {
total[j]+= w[j][i] * value;
}
}
Have you other ideas (except for openmp, which I already use) on how I could speed up these for-loops?
I am using:
gcc -DMNIST=1 -O3 -fno-strict-aliasing -std=c99 -lm -D_GNU_SOURCE -Wall -pedantic -fopenmp
Thanks!
One of the theories is that testing for zero is faster than testing for
j<m. So by looping fromj=mwhilej>0, in theory you could save some nanoseconds per loop. However in recent experience this has made not a single difference to me, so I think this doesn’t hold for current cpu’s.Another issue is memory layout: if your inner loop accesses a chunk of memory that isn’t spread out, but continuous, chances are you have more benefit of the lowest cache available in your CPU.
In your current example, switching the layout of
wfromw[j][i]tow[i][j]may therefore help. Aligning your values on 4 or 8 bytes boundaries will help as well (but you will find that this is already the case for your arrays)Another one is loop-unrolling, meaning that you do your inner loop in chunks of, say, 4. So the evaluation if the loop is done, has to be done 4 times less. The optimum value must be determined emperically, and may also depend on the problem at hand (e.g. if you know you’re looping a multiple of 5 times, use 5)