I have a very weird problem using OpenMP in my C++ code:
void update(double *source, double *target, int n)
{
target[0] = source[0];
target[n-1] = source[n-1];
#pragma omp parallel for
for(int i = 1; i < n-1; ++i)
target[i] = (1.0/3.0) * (source[i-1] + source[i] + source[i+1]);
}
Both source and target are double arrays with n elements. The code works fine when using it without OpenMP. But as soon as I use the pragma, the code seems to get stuck in this loop. The thing is: I have absolutely NO IDEA why. Hope anyone can help me
How large is n?
The default scheduling for a OpenMP
parallel fordirective is implementation specific. It looks like in GOMP (the OpenMP implementation used by gcc), the default is(dynamic,1)according to the documentation here. This means that each thread is accessing (ati-1andi+1) memory locations that are loaded by neighboring threads, which could lead to poor cache utilization. On modern CPU architectures, stencil operations like this are frequently memory-bound and sensitive to caching. You could try specifying a schedule with larger chunks, for instance with:I’m just using 1024 here as an example. In practice, you should experiment to find the optimal chunking factor (or systematically search with a parameter sweep, a process often called “auto-tuning”). Or you could choose a value based more in theory, for instance by deriving it from the L1 or L2 cache size of your CPU.
Or you could instead try static scheduling, since the amount of computation inside the for loop is uniform across threads, and the overhead of the dynamic scheduler may be causing a bottleneck. If you specify
without a chunk size, then each thread will be assigned a single chunk of roughly the same size.
Finally, you may also want to pin the OpenMP threads to their own CPU cores. You can do this using the GOMP_CPU_AFFINITY environment variable.
Edit:
I was just playing around with the following test program compiled with gcc 4.2.1, and I think the documentation I linked to above is incorrect. It looks like GOMP defaults to
schedule(static).And the output with two threads is: