So I’m having a little trouble figuring out the best way parallelizing these for

Question

0

Asked: May 26, 20262026-05-26T20:24:01+00:00 2026-05-26T20:24:01+00:00

So I’m having a little trouble figuring out the best way parallelizing these for

0

So I’m having a little trouble figuring out the best way parallelizing these for loops using openmp. I’m guessing that biggest speed up would come from parallelizing the middle loop like I do here:

for(i = 0; i < m/16*16; i+=16){
    #pragma omp parallel for
        for(j = 0; j < m; j++){

            C_column_start = C+i+j*m;

            c_1 = _mm_loadu_ps(C_column_start);
            c_2 = _mm_loadu_ps(C_column_start+4);
            c_3 = _mm_loadu_ps(C_column_start+8);
            c_4 = _mm_loadu_ps(C_column_start+12);

            for (k=0; k < n; k+=2){

                A_column_start = A+k*m;

                a_1 = _mm_loadu_ps(A_column_start+i);
                a_2 = _mm_loadu_ps(A_column_start+i+4);
                a_3 = _mm_loadu_ps(A_column_start+i+8);
                a_4 = _mm_loadu_ps(A_column_start+i+12);

                b_1 = _mm_load1_ps(A_column_start+j);

                mul_1 = _mm_mul_ps(a_1, b_1);
                mul_2 = _mm_mul_ps(a_2, b_1);
                mul_3 = _mm_mul_ps(a_3, b_1);
                mul_4 = _mm_mul_ps(a_4, b_1);

                c_4 = _mm_add_ps(c_4, mul_4);
                c_3 = _mm_add_ps(c_3, mul_3);
                c_2 = _mm_add_ps(c_2, mul_2);
                c_1 = _mm_add_ps(c_1, mul_1);

                A_column_start+=m;

                a_1 = _mm_loadu_ps(A_column_start+i);
                a_2 = _mm_loadu_ps(A_column_start+i+4);
                a_3 = _mm_loadu_ps(A_column_start+i+8);
                a_4 = _mm_loadu_ps(A_column_start+i+12);

                b_1 = _mm_load1_ps(A_column_start+j);

                mul_1 = _mm_mul_ps(a_1, b_1);
                mul_2 = _mm_mul_ps(a_2, b_1);
                mul_3 = _mm_mul_ps(a_3, b_1);
                mul_4 = _mm_mul_ps(a_4, b_1);

                c_4 = _mm_add_ps(c_4, mul_4);
                c_3 = _mm_add_ps(c_3, mul_3);
                c_2 = _mm_add_ps(c_2, mul_2);
                c_1 = _mm_add_ps(c_1, mul_1);

            }


            _mm_storeu_ps(C_column_start, c_1);
            _mm_storeu_ps(C_column_start+4, c_2);
            _mm_storeu_ps(C_column_start+8, c_3);
            _mm_storeu_ps(C_column_start+12, c_4);

        }

    }

However, this is currently giving me almost no speed up. Any tips would be awesome. I’ve been stuck for quite a while now.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T20:24:01+00:00

First of all, are you sure the loops are parallelizable? What is the range of the value m?

When all three nested loops are parallelizable and m is sufficiently large (say at least 16 or so), then parallelizing the outer most loop will be most beneficial. Parallelizaing inner loops may have severe forking-joining overhead from omp parallel for.

For the low speedup, here is some checklist:

Are you sure all cores are utilized? Check a sort of task manager. Since no synchronization except for the implicit barrier of omp parallel for, all cores are should be utilized.
Is m a huge number? And, what is the computation length? If m is huge while the computation is small, the parallel overhead due to omp parallel for may offset the benefit of the parallelization. When you parallelize an inner loop, the trip count (e.g., m) should not be huge.
False sharing could be a reason. If your code modifies a lot of memory, it may cause false sharing, and the speedup could be hurt.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

So I’m having a little trouble figuring out the best way parallelizing these for

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply