First method (parallelize inner loop):
for(j=0; j<LATTICE_VW; ++j) {
x = j*DX + LATTICE_W;
#pragma omp parallel for ordered private(y, prob)
for(i=0; i<LATTICE_VH; ++i) {
y = i*DY + LATTICE_S;
prob = psi[i][j].norm();
#pragma omp ordered
out << x << " " << y << " " << prob << endl;
}
}
Second method (parallelize outer loop):
#pragma omp parallel for ordered private(x, y, prob)
for(j=0; j<LATTICE_VW; ++j) {
x = j*DX + LATTICE_W;
for(i=0; i<LATTICE_VH; ++i) {
y = i*DY + LATTICE_S;
prob = psi[i][j].norm();
#pragma omp ordered
out << x << " " << y << " " << prob << endl;
}
}
Third method (parallelize collapsed loops)
#pragma omp parallel for collapse(2) ordered private(x, y, prob)
for(j=0; j<LATTICE_VW; ++j) {
for(i=0; i<LATTICE_VH; ++i) {
x = j*DX + LATTICE_W;
y = i*DY + LATTICE_S;
prob = psi[i][j].norm();
#pragma omp ordered
out << x << " " << y << " " << prob << endl;
}
}
If I was going to guess I would say that method 3 should be the fastest.
However method 1 is the fastest, while both the second and third take about the same ammount of time as if there was no parallelization. Why does this happens?
Look with this:
you have:
Each thread just has to wait for some
coutall the work before can be done in parallel.But with:
and
the situations is:
So
thread 1has to wait forthread 0to finish all its work, before it cancoutthe first time, and nearly nothing can be done in parallel.Try adding
schedule(static,1)to the collapse-version and it should perform at least as good as the first version does.