I try to write simple application using OpenMP. Unfortunately I have problem with speedup.
In this application I have one while loop. Body of this loop consists of some instructions which should be done sequentially and one for loop. I use #pragma omp parallel for to make this for loop parallel. This loop doesn’t have much work, but is called very often.
I prepare two versions of for loop, and run application on 1, 2 and 4cores.
version 1 (4 iterations in for loop): 22sec, 23sec, 26sec.
version 2 (100000 iterations in for loop): 20sec, 10sec, 6sec.
As you can see, when for loop doesn’t have much work, time on 2 and 4 cores is higher than on 1core.
I guess the reason is that #pragma omp parallel for creates new threads in each iteration of while loop. So, I would like to ask you – is there any possibility to create threads once (before while loop), and ensure that some job in while loop will be done sequentially?
#include <omp.h>
#include <iostream>
#include <math.h>
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
int main(int argc, char* argv[])
{
double sum = 0;
while (true)
{
// ...
// some work which should be done sequentially
// ...
#pragma omp parallel for num_threads(atoi(argv[1])) reduction(+:sum)
for(int j=0; j<4; ++j) // version 2: for(int j=0; j<100000; ++j)
{
double x = pow(j, 3.0);
x = sqrt(x);
x = sin(x);
x = cos(x);
x = tan(x);
sum += x;
double y = pow(j, 3.0);
y = sqrt(y);
y = sin(y);
y = cos(y);
y = tan(y);
sum += y;
double z = pow(j, 3.0);
z = sqrt(z);
z = sin(z);
z = cos(z);
z = tan(z);
sum += z;
}
if (sum > 100000000)
{
break;
}
}
return 0;
}
You could move the parallel region outside of the
while (true)loop and use thesingledirective to make the serial part of the code to execute in one thread only. This will remove the overhead of the fork/join model. Also OpenMP is not really useful on thight loops with very small number of iterations (like your version 1). You are basically measuring the OpenMP overhead since the work inside the loop is done really fast – even 100000 iterations with transcendental functions take less than second on current generation CPU (at 2 GHz and roughly 100 cycles per FP instruciton other than addition, it’ll take ~100 ms).That’s why OpenMP provides the
if(condition)clause that can be used to selectively turn off the parallelisation for small loops:It is also advisable to use
schedule(static)for regular loops (that is for loops in which every iteration takes about the same time to compute).