I have just read a blogpost here and try to do a similar thing,

Question

0

Asked: June 11, 20262026-06-11T22:04:46+00:00 2026-06-11T22:04:46+00:00

I have just read a blogpost here and try to do a similar thing,

0

I have just read a blogpost here and try to do a similar thing, here is my code to check what is in example 1 and 2:

int doSomething(long numLoop,int cacheSize){
    long k;
    int arr[1000000];
    for(k=0;k<numLoop;k++){
        int i;
        for  (i = 0; i < 1000000; i+=cacheSize) arr[i] = arr[i];
    }
}

As stated in the blogpost, the execution time for doSomething(1000,2) and doSomething(1000,1) should be almost the same, but I got 2.1s and 4.3s respectively. Can anyone help me explain?
Thank you.

Update 1:
I have just increased the size of my array to 100 times larger

int doSomething(long numLoop,int cacheSize){
    long k;
    int * buffer;
    buffer = (int*) malloc (100000000 * sizeof(int));
    for(k=0;k<numLoop;k++){
        int i;
        for  (i = 0; i < 100000000; i+=cacheSize) buffer[i] = buffer[i];
    }
}

Unfortunately, the execution time of doSomething(10,2) and doSomething(10,1) are still much different: 3.02s and 5.65s. Can anyone test this on your machine?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-11T22:04:47+00:00

Your array size of 4M is not big enough. The entire array fits in the cache (and is in the cache after the first k loop) so the timing is dominated by instruction execution. If you make arr much bigger than the cache size you will start to see the expected effect.

(You will see an additional effect when you make arr bigger than the cache: Runtime should increase linearly with arr size until you exceed the cache, when you will see a knee in performance and it will suddenly get worse and runtime will increase on a new linear scale)

Edit: I tried your second version with the following changes:

Change to volatile int *buffer to ensure buffer[i] = buffer[i] is not optimized away.
Compile with -O2 to ensure the loop is optimized sufficiently to prevent loop overhead from dominating.

When I try that I get almost identical times:

kronos /tmp $ time ./dos 2
./dos 2  1.65s user 0.29s system 99% cpu 1.947 total
kronos /tmp $ time ./dos 1
./dos 1  1.68s user 0.25s system 99% cpu 1.926 total

Here you can see the effects of making the stride two full cachelines:

kronos /tmp $ time ./dos 16
./dos 16  1.65s user 0.28s system 99% cpu 1.926 total
kronos /tmp $ time ./dos 32
./dos 32  1.06s user 0.30s system 99% cpu 1.356 total

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have just read a blogpost here and try to do a similar thing,

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply