First method (parallelize inner loop): for(j=0; j<LATTICE_VW; ++j) { x = j*DX + LATTICE_W;

Question

0

Asked: June 17, 20262026-06-17T22:38:07+00:00 2026-06-17T22:38:07+00:00

First method (parallelize inner loop): for(j=0; j<LATTICE_VW; ++j) { x = j*DX + LATTICE_W;

0

First method (parallelize inner loop):

for(j=0; j<LATTICE_VW; ++j) {
    x = j*DX + LATTICE_W;
    #pragma omp parallel for ordered private(y, prob)
        for(i=0; i<LATTICE_VH; ++i) {
            y = i*DY + LATTICE_S;
            prob = psi[i][j].norm();

            #pragma omp ordered
                out << x << " " << y << " " << prob << endl;
        }
}

Second method (parallelize outer loop):

#pragma omp parallel for ordered private(x, y, prob)
    for(j=0; j<LATTICE_VW; ++j) {
        x = j*DX + LATTICE_W;
        for(i=0; i<LATTICE_VH; ++i) {
            y = i*DY + LATTICE_S;
            prob = psi[i][j].norm();

            #pragma omp ordered
                out << x << " " << y << " " << prob << endl;
        }
    }

Third method (parallelize collapsed loops)

#pragma omp parallel for collapse(2) ordered private(x, y, prob)
    for(j=0; j<LATTICE_VW; ++j) {
        for(i=0; i<LATTICE_VH; ++i) {
            x = j*DX + LATTICE_W;
            y = i*DY + LATTICE_S;
            prob = psi[i][j].norm();

            #pragma omp ordered
                out << x << " " << y << " " << prob << endl;
        }
    }

If I was going to guess I would say that method 3 should be the fastest.

However method 1 is the fastest, while both the second and third take about the same ammount of time as if there was no parallelization. Why does this happens?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T22:38:09+00:00

Look with this:

for(int x = 0; x < 4; ++x)
  #pragma omp parallel for ordered
  for(int y = 0; y < 4; ++y)
    #pragma omp ordered
    cout << x << ',' << y << " (by thread " << omp_get_thread_num() << ')' << endl;

you have:

0,0 (by thread 0)
0,1 (by thread 1)
0,2 (by thread 2)
0,3 (by thread 3)
1,0 (by thread 0)
1,1 (by thread 1)
1,2 (by thread 2)
1,3 (by thread 3)

Each thread just has to wait for some cout all the work before can be done in parallel.
But with:

#pragma omp parallel for ordered
for(int x = 0; x < 4; ++x)
  for(int y = 0; y < 4; ++y)
    #pragma omp ordered
    cout << x << ',' << y << " (by thread " << omp_get_thread_num() << ')' << endl;

and

#pragma omp parallel for collapse(2) ordered
for(int x = 0; x < 4; ++x)
  for(int y = 0; y < 4; ++y)
    #pragma omp ordered
    cout << x << ',' << y << " (by thread " << omp_get_thread_num() << ')' << endl;

the situations is:

0,0 (by thread 0)
0,1 (by thread 0)
0,2 (by thread 0)
0,3 (by thread 0)
1,0 (by thread 1)
1,1 (by thread 1)
1,2 (by thread 1)
1,3 (by thread 1)
2,0 (by thread 2)
2,1 (by thread 2)
2,2 (by thread 2)
2,3 (by thread 2)
3,0 (by thread 3)
3,1 (by thread 3)
3,2 (by thread 3)
3,3 (by thread 3)

So thread 1 has to wait for thread 0 to finish all its work, before it can cout the first time, and nearly nothing can be done in parallel.

Try adding schedule(static,1) to the collapse-version and it should perform at least as good as the first version does.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

First method (parallelize inner loop): for(j=0; j<LATTICE_VW; ++j) { x = j*DX + LATTICE_W;

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply