I’m trying to implement a parallel algorithm using tasks within OpenMP.
The parallel programming pattern is based on the producer-consumer idea, but
since the consumer process is slower than the producer, I want to use a few
producers and several consumers.
The main idea is to create as many OS threads as producers and then each of
these will create tasks to be done in parallel (by the consumers). Every
producer will be associated with a proportional number of consumers (i.e
numCheckers/numSeekers).
I’m running the algorithm on an Intel Dual-chip server with 6 cores per chip.
The thing is that when I use only one producer(seeker) and an increasing number
of consumers(checkers) the performance decays very fast as the number of
consumers grows (see table below), even though the correct number of cores are
working at a 100%.
On the other hand, if I increase the number of producers, the average time
decreases or at least stays stable, even with a proportional number of
consumers.
It seems to me that all the improvement is made by the division of the input
among the producers, and the tasks are only bugging. But again, I don’t have any
explanation for the behavior with one producers. Am I missing something in the
OpenMP-Task logic? Am I doing something wrong?
-------------------------------------------------------------------------
| producers | consumers | time |
-------------------------------------------------------------------------
| 1 | 1 | 0.642935 |
| 1 | 2 | 3.004023 |
| 1 | 3 | 5.332524 |
| 1 | 4 | 7.222009 |
| 1 | 5 | 9.472093 |
| 1 | 6 | 10.372389 |
| 1 | 7 | 12.671839 |
| 1 | 8 | 14.631013 |
| 1 | 9 | 14.500603 |
| 1 | 10 | 18.034931 |
| 1 | 11 | 17.835978 |
-------------------------------------------------------------------------
| 2 | 2 | 0.357881 |
| 2 | 4 | 0.361383 |
| 2 | 6 | 0.362556 |
| 2 | 8 | 0.359722 |
| 2 | 10 | 0.358816 |
-------------------------------------------------------------------------
The main section of my code is as fallow:
int main( int argc, char** argv) {
// ... process the input (read from file, etc...)
const char *buffer_start[numSeekers];
int buffer_len[numSeekers];
//populate these arrays dividing the input
//I need to do this because I need to overlap the buffers for
//correctness, so I simple parallel-for it's not enough
//Here is where I create the producers
int num = 0;
#pragma omp parallel for num_threads(numSeekers) reduction(+:num)
for (int i = 0; i < numSeekers; i++) {
num += seek(buffer_start[i], buffer_len[i]);
}
return (int*)num;
}
int seek(const char* buffer, int n){
int num = 0;
//asign the same number of consumers for each producer
#pragma omp parallel num_threads(numCheckers/numSeekers) shared(num)
{
//only one time for every producer
#pragma omp single
{
for(int pos = 0; pos < n; pos += STEP){
if (condition(buffer[pos])){
#pragma omp task shared(num)
{
//check() is a sequential function
num += check(buffer[pos]);
}
}
}
#pragma omp taskwait
}
return num;
}
The observed behaviour is due to the fact that you do not have nested
parallelregions enabled. What happens is that in the first case you are actually experiencing the HUGE overhead of OpenMP tasks. This is most likely due to the fact thatcheck()is not doing enough work compared to the overhead the OpenMP run-time introduces. Why it behaves like so with 1 and with 2 producers?When running with just one producer, the outer
parallelregion is executing with one thread only. Suchparallelregions are inactive according to the OpenMP API specification and they just execute the code inside serially (the only overhead being an additional function call and accessing shared variables via pointers). In this case, the innerparallelregion, although being nested while nested parallelism is disabled, becomes active and spurs a lot of tasks. Tasks introduce relatively high overhead and this overhead increases with the number of threads. With 1 consumer the innerparallelregion is also inactive and hence runs serially without tasks overhead.When running with two producers, the outer
parallelregion is active and hence the innerparallelregion is rendered inactive (remember – no nested parallelism is enabled) and as a consequence, no tasks are being created at all –seek()simply runs serially. There is no tasking overhead and the code runs almost twice as fast than the 1 producer / 1 consumer case. The run time is not dependent on the number of consumers because the innerparallelregion is always inactive, no matter how many threads are specified.How big is the overhead that tasking and concerted access to shared variables introduce? I’ve created a simple synthetic benchmark that executes the following code:
Serially this executes for less than a second on a Westmere CPU with the default optimisation level of GCC 4.7.2. Then I’ve introduced tasks using simple
atomicconstructs to protect the access to the shared variablessum:(no need for
taskwaithere since there is a scheduling point at the implicit barrier at the end of theparallelregion)I’ve also created a more complex variant that performs reduction the same way that Massimiliano has proposed:
Code was compiled with GCC 4.7.2 like:
Running it in batch mode (so no other processes could intervene) on a dual-socket Westmere system (12 cores in total) with a different number of threads and different threads placement on the sockets, one obtains the following run times for both codes:
(run times are given in seconds as measured by
omp_get_wtime(), averaged over 10 runs /std. deviation also shown/;x + yin theConfigurationcolumn meansxthreads on the first socket andythreads on the second socket)As you can see, the overhead from the tasks is enormous. It is much higher than the overhead from using
atomicinstead of applying reduction to thread-private accumulators. Besides, the assignment part of anatomicwith+=compiles to a locked compare-and-exchange instruction (LOCK CMPXCHG) – not much higher overhead than callingomp_get_thread_num()each time.One should also note that the dual-socket Westmere system is NUMA since each CPU has its own memory and accesses to the memory of the other CPU go through the QPI link and are hence with increased latency (and possibly lower bandwidth). As the
ssumvariable is shared in theatomiccase, threads that run on the second processor are essentially making remote requests. Still, the difference between both codes is negligible (except for the marked two-threads case – I have to investigate why) and theatomiccode even starts to outperform the one with the reduction when the number of threads gets higher.On multiscale NUMA systems the synchronisation in the
atomicapproach might become more of a burden, since it adds locking overhead to the already slower remote accesses. One such system is one of our BCS-coupled nodes. BCS (Bull Coherence Switch) is a proprietary solution from Bull that uses XQPI (eXternal QPI) to link together several Nehalem-EX boards into a single system, introducing three levels of NUMA in the way (local memory; remote memory on the same board; remote memory on a remote board). When running on one such system, consisting of 4 boards with 4 octocore Nehalem-EX CPUs each (128 cores in total), theatomicexecutable runs for 1036 s (!!), while the reduction approach runs for 1047 s, i.e. both still execute for about the same time (my previous statement that theatomicapproach is 21,5% slower was due to OS services jitter during the measurement). Both numbers are from single runs and hence are not quite representative. Note that on this system the XQPI link introduces very high latency for inter-board QPI messages and thus locking is very expensive, but not that expensive. Part of the overhead can be taken away by using reduction, but it has to be implemented properly. First, local copies of the reduction variable should be also local to the NUMA node where the thread executes and second, one should find a way to not make calls toomp_get_thread_num(). These two could be achieved in many different ways but the easiest one is just to usethreadprivatevariables:Access to
ssumtneeds no protection as two tasks seldom execute simultaneously in the same thread (have to further investigate if this conforms to the OpenMP specs). This version of the code executes for 972 s. Once again this is not that far from 1036 s and comes from a single measurement only (i.e. it could be simply a statistical fluctuation), but in theory, it should be faster.Lessons to take home:
parallelregions. Nested parallelism is enabled either by setting the environment variableOMP_NESTEDtotrueor by callingomp_set_nested(1);. If enabled, the level of active nesting can be controlled by theOMP_MAX_ACTIVE_LEVELSas pointed out by Massimiliano.