I’m havening look at some code that uses OpenMP, though I’m not too familiar with it. (The code nor OpenMP.)
When running a profiler against it, I see that the program is supposedly spending about 20% of wall-clock time in an “OMP implicit barrier” function.
Is that typical of OpenMP, or does that (maybe) imply that work load is not distributed evenly among threads?
Thanks
There are implicit barriers at the end of most OpenMP constructs like
for(in C/C++) ordo(in Fortran),sectionsandsingle(however, there’s no barrier at the end of themasterconstruct). Thenowaitclause can be used to disable these implicit barriers if the algorithm allows for the different threads to run desynchronised after the worksharing directive. Another implicit barrier is located at the end of each parallel region as part of the fork/join execution model.You have correctly guessed that high percentage of implicit barrier wait time usually means that the worksharing is far from optimal. It could be that there are (lots of) large
singleconstructs or it could be that there are parallel loops (for/doconstructs) with varying execution time for each iteration. If the imbalance comes from loops with varying computational time in each iteration (canonical example is drawing the Mandelbrot set), then the loop schedule can be changed todynamicusing theschedule(dynamic,chunk)clause, wherechunkis the chunk size (>= 1). The smaller the chunk size, the better is the load balanced but there would be higher overhead from the dynamical loop dispatcher. The bigger the chunk size the lower the overhead but more load imbalance would appear. The optimal value often depends on the kind of problem and on the hardware so one has to tweak the value in order to obtain the best performance on the particular system where the code executes.