Lets consider two almost identical codes:
First
for (int k=0;k<1000;k++)
{
for (int i=0;i<600;i++)
{
for (int j=0;j<600;j++)
{
tab[i][j] = i *j;
}
}
}
Second
for (int k=0;k<1000;k++)
{
for (int i=0;i<600;i++)
{
for (int j=0;j<600;j++)
{
tab[j][i] = i *j;
}
}
}
In the second one instead of tab[i][j] we have tab[j][i].
The first code is much faster.
Question
Why is the first code much faster?
My intuition
Is it because when a program tries to access a cell , firstly whole block that contains this cell is move to the cache, and then it is accessed via cache. Since array in memory is represented by consecutive cells, then in first case then in first case there is much less access to the memory than in the second case.
That’s because of cache locality. Processor cache line can hold several array elements at once but only from the adjacents addresses.
In the first case you have more cache hits – when you iterate over the second array index you access adjacent elements. You access some element, the processor loads it and its neighbours into the cache line, the next adjacent accesses produce cache hits – you no longer need to access memory to deal with them.
In the second case when you iterate over the first index you load some element, a line of cache gets filled, but the next access is to an element that is not in the same line. Thie makes the processor load yet another line into the cache. If the cache can’t hold all lines at once it has to discard previously loaded lines and reload them later. This greatly increases number of memory accesses and thus increases execution time