I write simple C++ code that compute array reduction sum, but with OpenMP reduction program works slowly. There are two variants of program: one is simplest sum, another – sum of complex math function. In code complex variant is commented.
#include <iostream>
#include <omp.h>
#include <math.h>
using namespace std;
#define N 100000000
#define NUM_THREADS 4
int main() {
int *arr = new int[N];
for (int i = 0; i < N; i++) {
arr[i] = i;
}
omp_set_num_threads(NUM_THREADS);
cout << NUM_THREADS << endl;
clock_t start = clock();
int sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++) {
// sum += sqrt(sqrt(arr[i] * arr[i])); // complex variant
sum += arr[i]; // simple variant
}
double diff = ( clock() - start ) / (double)CLOCKS_PER_SEC;
cout << "Time " << diff << "s" << endl;
cout << sum << endl;
delete[] arr;
return 0;
}
I compile it by ICPC and GCC:
icpc reduction.cpp -openmp -o reduction -O3
g++ reduction.cpp -fopenmp -o reduction -O3
Processor: Intel Core 2 Duo T5850, OS: Ubuntu 10.10
There are execution time of simple and complex variants, compiled with and without OpenMP.
Simple variant “sum += arr[i];”:
icpc
0.1s without OpenMP
0.18s with OpenMP
g++
0.11c without OpenMP
0.17c with OpenMP
Complex variant “sum += sqrt(sqrt(arr[i] * arr[i]));”:
icpc
2,92s without OpenMP
3,37s with OpenMP
g++
47,97s without OpenMP
48,2s with OpenMP
In system monitor I see that 2 cores works in program with OpenMP and 1 core works in program without OpenMP. I’ll try several numbers of threads in OpenMP and dont have speedup. I don’t understand why reduction is slow.
The function
clock()measures processor time consumed by whole process, so printed time shows sum of time consumed by all threads. If you want to see wall-time (real time elapsed from the begin to the end), use e.g. times() function on the POSIX system