Some time ago I wrote a little piece of code to ask about on

Question

0

Asked: May 11, 20262026-05-11T01:57:30+00:00 2026-05-11T01:57:30+00:00

Some time ago I wrote a little piece of code to ask about on

0

Some time ago I wrote a little piece of code to ask about on interviews and see how people understand concepts of cache and memory:

#include 'stdafx.h' #include <stdlib.h> #include <windows.h> #include <iostream>  #define TOTAL 0x20000000  using namespace std;  __int64 count(int INNER, int OUTER) {     int a = 0;     int* arr = (int*) HeapAlloc(GetProcessHeap(), 0, INNER * sizeof(int));     if (!arr) {         cerr << 'HeapAlloc failed\n';         return 1;     }     LARGE_INTEGER freq;     LARGE_INTEGER startTime, endTime;     __int64 elapsedTime, elapsedMilliseconds;     QueryPerformanceFrequency(&freq);     QueryPerformanceCounter(&startTime);      /* Начало работы */      for (int i = 0; i < OUTER; i++) {         for (int j = 0; j < INNER; j++) {             a |= i;             arr[j] = a;         }     }      /* Конец работы */      QueryPerformanceCounter(&endTime);     elapsedTime = endTime.QuadPart - startTime.QuadPart;     elapsedMilliseconds = (1000 * elapsedTime) / freq.QuadPart;     HeapFree(GetProcessHeap(), 0, arr);     return elapsedMilliseconds; }  int _tmain(int argc, _TCHAR* argv[]) {     __int64 time;     for (int INNER = 0x10; INNER <= 0x2000000; INNER <<= 1) {         int OUTER = TOTAL / INNER;         time = count(INNER, OUTER);         cout << INNER << '\t' << OUTER << '\t' << time << '\n';     } }

That’s what it compiles to (the loop itself):

00401062  xor         ecx,ecx  00401064  test        ebp,ebp  00401066  jle         count+83h (401083h)  00401068  xor         eax,eax  0040106A  test        ebx,ebx  0040106C  jle         count+7Ch (40107Ch)  0040106E  mov         edi,edi  00401070  or          esi,ecx  00401072  mov         dword ptr [edi+eax*4],esi  00401075  add         eax,1  00401078  cmp         eax,ebx  0040107A  jl          count+70h (401070h)  0040107C  add         ecx,1  0040107F  cmp         ecx,ebp  00401081  jl          count+68h (401068h)

That’s what the program outputs on my machine:

LOG2(INNER) LOG2(OUTER)  Time, ms 4           25           523 5           24           569 6           23           441 7           22           400 8           21           367 9           20           358 10          19           349 11          18           364 12          17           378 13          16           384 14          15           357 15          14           377 16          13           379 17          12           390 18          11           386 19          10           419 20          9            995 21          8            1,015 22          7            1,038 23          6            1,071 24          5            1,082 25          4            1,119

I ask people to explain what’s going on. As inner array grows, the number of cycles decreases, as the time does. As inner array outgrows the cache, cache misses begin to happen and time raises. It’s all right so far.

BUT: when the INNER array size is 16 (that gives us 64 bytes of data), there is little performance boost, despite the greater number of jmps in code. It’s little (523 vs. 569), but reproducible.

The question is: why this boost?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

score 0 · Answer 1 · 2026-05-11T01:57:30+00:00

2026-05-11T01:57:30+00:00Added an answer on May 11, 2026 at 1:57 am

Probably because 64 is the cache line size on your machine, and you basically run each iteration fully out of a single cache line.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Some time ago I wrote a little piece of code to ask about on

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply