Consider the code below, taken from here. For this code, I get the following execution times:
time ./fibomp 40
Number of threads (OpenMP v200805): 2
finonacci(40) = 102334155
real 0m3.193s
user 0m3.180s
sys 0m0.000s
$ export OMP_NUM_THREADS=1
$ time ./fibomp 40
Number of threads (OpenMP v200805): 1
finonacci(40) = 102334155
real 0m3.224s
user 0m3.216s
sys 0m0.000s
So as you can see, there’s not much speed up, definitely not the 2x speedup Ruud mentions in his Tue Nov 01, 2011 1:41 am email. I’m running this on a dual core machine (could that be it?). What am I doing wrong? (BTW, BONUS points, what is the ptime command? Some SPARC Unix command?)
long comp_fib_numbers(int n)
{
long fnm1, fnm2, fn;
if ( n == 0 || n == 1 ) return(n);
// In case the sequence gets too short, execute the serial version
if ( n < 20 )
{
return(comp_fib_numbers(n-1)+comp_fib_numbers(n-2));
}
else
{
#pragma omp task shared(fnm1)
fnm1 = comp_fib_numbers(n-1);
#pragma omp task shared(fnm2)
fnm2 = comp_fib_numbers(n-2);
#pragma omp taskwait
fn = fnm1 + fnm2;
return(fn);
}
}
First of all, just to be sure, since you state that
htopshows that a single core is being used, make sure that you have enabled OpenMP support in your compiler. The option to do so is-fopenmpfor GCC,-xopenmpfor Sun/Oracle compilers and-openmpfor Intel compilers.Second of all,
n = 20might be too low of a cut off for the parallel implementation. A shameless plug – see this course material from a workshop on OpenMP that a colleague of mine gave some months ago. Several parallel versions with tasking are discussed there, starting from slide 20.Third,
ptimeis a Solaris command, not specific to SPARC since it is also available in the x86 version. Many process related Solaris commands have thepprefix in their names. Note that in your casetimeis more likely to be the built-in implementation that Bash provides rather than the standalone binary.Fourth, and may be the real answer to your question – you are missing a
parallelregion in your code so the task directives don’t work at all 🙂 You should rewrite your code as following:You could make the code even more terse by using the
ifclause to switch of the parallel region:If
nhappens to be less than 20, then the parallel region would execute single threaded. Since parallel regions are usually extracted in separate functions, there would still be an additional function call, unless the compiler choses to produce duplicate code. That’s why it is recommended that the serial implementation is extracted in its own function:Edit: Now that I’ve looked at the code that you have linked to, I can see that the call to
comp_fib_numbersis embedded into aparallelregion. So just disregard my comment about the missingparallelregion if you already have one in your code. I will leave it here just for completeness. Try tweaking the value at which the switch between the parallel and the serial version occurs. On modern processors it might be quite high and the example that you have seen is quite old. Also make sure that no dynamic teams are used by either setting the environment variableOMP_DYNAMICtofalse(or toFALSE) or by callingomp_set_dynamic(0);someplace before the parallel region.You haven’t stated what your compiler is but mind that OpenMP 3.0 is supported by GCC since version 4.4, by Intel compilers since version 11.0, by Sun/Oracle compilers since version I_dont_know and is not supported at all by the Visual C/C++ compilers.
Observed speedup on an quad-socket Intel Xeon X7350 system (old pre-Nehalem system with FSB)
With the cut-off set to
25(seems to be the optimal value for the X7350):With the cut-off set to
25and a separate function for the serial implementation:See how the user time decreases by some 400 ms. This is because of the removed overhead.
These were measured with the code from the site that you have linked to. Used compiler is GCC 4.4.6 on a 64-bit Scientific Linux 6.2 system.