I have just read a blogpost here and try to do a similar thing, here is my code to check what is in example 1 and 2:
int doSomething(long numLoop,int cacheSize){
long k;
int arr[1000000];
for(k=0;k<numLoop;k++){
int i;
for (i = 0; i < 1000000; i+=cacheSize) arr[i] = arr[i];
}
}
As stated in the blogpost, the execution time for doSomething(1000,2) and doSomething(1000,1) should be almost the same, but I got 2.1s and 4.3s respectively. Can anyone help me explain?
Thank you.
Update 1:
I have just increased the size of my array to 100 times larger
int doSomething(long numLoop,int cacheSize){
long k;
int * buffer;
buffer = (int*) malloc (100000000 * sizeof(int));
for(k=0;k<numLoop;k++){
int i;
for (i = 0; i < 100000000; i+=cacheSize) buffer[i] = buffer[i];
}
}
Unfortunately, the execution time of doSomething(10,2) and doSomething(10,1) are still much different: 3.02s and 5.65s. Can anyone test this on your machine?
Your array size of 4M is not big enough. The entire array fits in the cache (and is in the cache after the first
kloop) so the timing is dominated by instruction execution. If you makearrmuch bigger than the cache size you will start to see the expected effect.(You will see an additional effect when you make
arrbigger than the cache: Runtime should increase linearly witharrsize until you exceed the cache, when you will see a knee in performance and it will suddenly get worse and runtime will increase on a new linear scale)Edit: I tried your second version with the following changes:
volatile int *bufferto ensurebuffer[i] = buffer[i]is not optimized away.-O2to ensure the loop is optimized sufficiently to prevent loop overhead from dominating.When I try that I get almost identical times:
Here you can see the effects of making the stride two full cachelines: