Which one can gain a better performance?
Example 1
#pragma omp parallel for private (i,j)
for(i = 0; i < 100; i++) {
for (j=0; j< 100; j++){
....do sth...
}
}
Example 2
for(i = 0; i < 100; i++) {
#pragma omp parallel for private (i,j)
for (j=0; j< 100; j++){
....do sth...
}
}
Follow up question Is it valid to use Example 3?
#pragma omp parallel for private (i)
for(i = 0; i < 100; i++) {
#pragma omp parallel for private (j)
for (j=0; j< 100; j++){
....do sth...
}
}
In general, Example 1 is the best as it parallelizes the outer most loop, which minimizes thread fork/join overhead. Although many OpenMP implementations pre-allocate the thread pool, there are still overhead to dispatch logical tasks to worker threads (a.k.a. a team of thread) and join them. Also note that when you use a dynamic scheduling (e.g.,
schedule(dynamic, 1)), then this task dispatch overhead would be problematic.So, Example 2 may incur significant parallel overhead, especially when the trip count of
for-iis large (100 is okay, though), and the amount of workload offor-jis small. Small may be an ambiguous term and depends on many variables. But, less than 1 millisecond would be definitely wasteful to use OpenMP.However, in case where the
for-iis not parallelizable and onlyfor-jis parallelizable, then Example2 is the only option. In this case, you must consider carefully whether the amount of parallel workload can offset the parallel overhead.Example3 is perfectly valid once
for-iandfor-jare safely parallelizable (i.e., no loop-carried flow dependences in each two loops, respectively). Example3 is called nested parallelism. You may take a look this article. Nested parallelism should be used with care. In many OpenMP implementations, you need to manually turn on nested parallelism by callingomp_set_nested. However, as nested parallelism may spawn huge number of threads, its benefit may be significantly reduced.