Given the following code :
for (int i=0; i<n; i++)
{
counter += myArray[i];
}
And the Loop unrolling version :
for (int i=0; i<n; i+=4)
{
counter1 += myArray[i+0];
counter2 += myArray[i+1];
counter3 += myArray[i+2];
counter4 += myArray[i+3];
}
total = counter1+ counter2 + counter3+ counter4;
- Why do we have a cache miss in the first version ?
- Is the second version has indeed a better performance than the 1st ? why ?
Regards
As Oli points out in the comments. This question is unfounded. If the data is already in the cache, then there will be no cache misses.
That aside, there is no difference in memory access between your two examples. So that will not likely be a factor in any performance difference between them.
Usually, the thing to do is to actually measure. But in this particular example, I’d say that it will likely be faster. Not because of better cache access, but because of the loop-unrolling.
The optimization that you are doing is called “Node-Splitting”, where you separate the
countervariable for the purpose of breaking the dependency chain.However, in this case, you are doing a trivial reduction operation. Many modern compilers are able to recognize this pattern and do this node-splitting for you.
So is it faster? Most likely. But you should check to see if the compiler does it for you.
For the record: I just tested this on Visual Studio 2010.
And I am quite surprised that it is not able to do this optimization.
Visual Studio 2010 does not seem to be capable of performing “Node Splitting” for this (trivial) example…