Suppose a general sliding algorithm that executes some function on a kernel, like a mean-filter (average-filter) or the sum of absolute differences algorithm in image processing. As the kernel slides to the next position, there will be some redundant reads from memory because the data enclosed by the new kernel will overlap that of the previous somewhat.
Let me explain with a practical example… Suppose you want to perform a median-filter on a large 2D matrix with a kernel (window) size of 3×3. The first position of the kernel (red in image below) would be centered at (1,1), the second position (green) would be centered at (1,2). Notice how the yellow area is the overlap, and these values now need to be reloaded from memory.
My specific problem is a 3D mean filter so the overlap is even bigger (3^3-3^2 = 18 for 3D vs 3^2-3 = 6 for 2D).
I’m sure this is a common problem… does anyone know how algorithms such as this are implemented efficiently to either eliminate the redundant memory lookups, or to exploit spatial and temporal locality of the CPU cache on modern architectures (e.g. 2-way associative cache)?
My specific problem in 3D takes only the mean from the nearest 6 neighbours (not the diagonal ones) and is implemented in C as follows:
for( i = 0; i <= maxi; i++ ) {
for( j = 0; j <= maxj; j++ ) {
for( k = 0; k <= maxk; k++ ) {
filteredData[ i ][ j ][ k ] =
ONE_SIXTH *
(
data[ i + 1 ][ j ][ k ] +
data[ i - 1 ][ j ][ k ] +
data[ i ][ j + 1 ][ k ] +
data[ i ][ j - 1 ][ k ] +
data[ i ][ j ][ k + 1 ] +
data[ i ][ j ][ k - 1 ]
);
}
}
}

What you are doing is called Convolution. You convolve the multidimensional data with a smaller kernel of the same number of dimensions. It is a very common task, and there are plenty of libraries for it.
A fast solution (depending on the kernel size) is to calculate the convolution in the frequency domain. You calculate the (multidimensional) FFT of both data and kernel, multiply them, and calculate the inverse FFT. You will find libraries optimized to do just that, eg. for Python there is scipy.ndimage.filters.convolve and scipy.signal.fftconvolve.
Tiling is a common image processing technique to optimize low-level memory access. You allocate square tiles (or cubes) that fit well into the CPU cache. When you access neighbouring pixels they will be close together in memory most of the time. Looping over the whole array gets a bit tricky, though.
For further reading I reccommend the paper Why Modern CPUs Are Starving and What Can Be Done about It, which mentions this memory blocking technique, and points to numerical libraries that implement it.
And finally there is the Integral Image that allows you to calculate the average of an arbitrary rectangle/cuboid with just a very small number of memory accesses.