I don’t know how to optimize cache performance at a really low level, thinking about cache-line size or associativity. That’s not something you can learn overnight. Considering my program will run on many different systems and architectures, I don’t think it would be worth it anyway. But still, there are probably some steps I can take to reduce cache misses in general.
Here is a description of my problem:
I have a 3d array of integers, representing values at points in space, like [x][y][z]. Each dimension is the same size, so it’s like a cube. From that I need to make another 3d array, where each value in this new array is a function of 7 parameters: the corresponding value in the original 3d array, plus the 6 indices that “touch” it in space. I’m not worried about the edges and corners of the cube for now.
Here is what I mean in C++ code:
void process3DArray (int input[LENGTH][LENGTH][LENGTH],
int output[LENGTH][LENGTH][LENGTH])
{
for(int i = 1; i < LENGTH-1; i++)
for (int j = 1; j < LENGTH-1; j++)
for (int k = 1; k < LENGTH-1; k++)
//The for loops start at 1 and stop before LENGTH-1
//or other-wise I'll get out-of-bounds errors
//I'm not concerned with the edges and corners of the
//3d array "cube" at the moment.
{
int value = input[i][j][k];
//I am expecting crazy cache misses here:
int posX = input[i+1] [j] [k];
int negX = input[i-1] [j] [k];
int posY = input[i] [j+1] [k];
int negY = input[i] [j-1] [k];
int posZ = input[i] [j] [k+1];
int negZ = input[i] [j] [k-1];
output [i][j][k] =
process(value, posX, negX, posY, negY, posZ, negZ);
}
}
However, it seems like if LENGTH is large enough, I’ll get tons of cache misses when I’m fetching the parameters for process. Is there a cache-friendlier way to do this, or a better way to represent my data other than a 3d array?
And if you have the time to answer these extra questions, do I have to consider the value of LENGTH? Like it’s different whether LENGTH is 20 vs 100 vs 10000. Also, would I have to do something else if I used something other than integers, like maybe a 64-byte struct?
@ ildjarn:
Sorry, I did not think that the code that generates the arrays I am passing into process3DArray mattered. But if it does, I would like to know why.
int main() {
int data[LENGTH][LENGTH][LENGTH];
for(int i = 0; i < LENGTH; i++)
for (int j = 0; j < LENGTH; j++)
for (int k = 0; k < LENGTH; k++)
data[i][j][k] = rand() * (i + j + k);
int result[LENGTH][LENGTH][LENGTH];
process3DArray(data, result);
}
The most important thing you already have right. If you were using Fortran, you’d be doing it exactly wrong, but that’s another story. What you have right is that you are processing in the inner loop along the direction where memory addresses are closest together. A single memory fetch (beyond the cache) will pull in multiple values, corresponding to a series of adjacent values of k. Inside your loop the cache will contain some number of values from i,j; a similar number from i+/-1, j and from i,j+/-1. So you basically have five disjoint sections of memory active. For small values of LENGTH these will only be 1 or three sections of memory. It is in the nature of how caches are built that you can have more than this many disjoint sections of memory in your active set.
I hope process() is small, and inline. Otherwise this may well be insignificant. Also, it will affect whether your code fits in the instruction cache.
Since you’re interested in performance, it is almost always better to initialize five pointers (you only need one for value, posZ and negZ), and then take *(p++) inside the loop.
is asking the compiler to generate 3 adds and two multiplies, unless you have a very good optimizer. If your compiler is particularly lazy about register allocation, you also get four memory accesses; otherwise one.
is asking for one add and a memory reference.