I need to read a massive amount of data into a buffer (about 20gig). I have 192gb of very fast DDram available, so no issue with memory size. However, I am finding that the following code runs slower and slower the further it gets into the buffer. The Visual C profiler tells me that 68% of the 12 minute execution time is in the 2 statements inside the loop in myFunc(). I am running win7, 64bit on a very fast dell with 2 cpu’s, 6 physical cores each (24 logical cores), and all 24 cores are completely maxed out while running this.
#define TREAM_COUNT 9000
#define ARRAY_SIZE ONE_BILLION
#define offSet(a,b,c,d) ( ((size_t) ARRAY_SIZE * (a)) + ((size_t) TREAM_COUNT * 800 * (b)) + ((size_t) 800 * (c)) + (d) )
void myFunc(int dogex, int ptxIndex, int xtreamIndex, int carIndex)
{
short *ptx = (short *) calloc(ARRAY_SIZE * 20, sizeof(short));
#pragma omp parallel for
for (int bIndex = 0; bIndex < 800; ++bIndex)
doWork(dogex, ptxIndex, carIndex);
}
void doWork(int dogex, int ptxIndex, int carIndex)
{
for (int treamIndex = 0; treamIndex < ONE_BILLION; ++treamIndex)
{
short ptxValue = ptx[ offSet(dogex, ptxIndex, treamIndex, carIndex) ];
short lastPtxValue = ptx[ offSet(dogex, ptxIndex-1, treamIndex, carIndex) ];
// ....
}
}
The code allocated 20 blocks of one billion short ints. On a 64-bit Windows box, a short int is 2 bytes. So the allocation is ~40 gigabytes.
You say there are 24 cores and they’re all maxed out. The code as it is doesn’t appear to show any parallelism. The way in which the code is parallelised could have a profound effect upon performance. You may need to provide more information.
—
Your basic problem, I suspect, revolves around cache behaviour and memory access limits.
First, with two physical CPUs of six cores each, you will utterly saturate your memory bus. Probably you have a NUMA architecture anyway, but there’s no control in the code about where your calloc() allocates (e.g. you could have a lot of code stored in memory which requires multiple hops to reach).
Hyperthreading is turned on. This effectively halves cache sizes. Given the code is memory bus bound, rather than compute bound, hyperthreading is harmful. (Having said that, if computation is constantly outside of cache bounds anyway, this won’t change much).
It’s not clear (since some/much?) code is removed, how the array is being accessed and the access pattern and optimimzation of that pattern to honour cache optimization is the key to performance.
What I see in how offset() is caculated is that the code is constantly requiring the generation of new virtual to physical address lookups – each of which requires something like four or five memory accesses. This is kiling performance, by itself.
My basic advice would be break the array up into level 2 cache-sized blocks, give one block to each CPU and let it process that block. You can do that in parallel. Actually, you might be able to use hyperthreading to pre-load the cache, but that’s a more advanced technique.