So I’m having a little trouble figuring out the best way parallelizing these for loops using openmp. I’m guessing that biggest speed up would come from parallelizing the middle loop like I do here:
for(i = 0; i < m/16*16; i+=16){
#pragma omp parallel for
for(j = 0; j < m; j++){
C_column_start = C+i+j*m;
c_1 = _mm_loadu_ps(C_column_start);
c_2 = _mm_loadu_ps(C_column_start+4);
c_3 = _mm_loadu_ps(C_column_start+8);
c_4 = _mm_loadu_ps(C_column_start+12);
for (k=0; k < n; k+=2){
A_column_start = A+k*m;
a_1 = _mm_loadu_ps(A_column_start+i);
a_2 = _mm_loadu_ps(A_column_start+i+4);
a_3 = _mm_loadu_ps(A_column_start+i+8);
a_4 = _mm_loadu_ps(A_column_start+i+12);
b_1 = _mm_load1_ps(A_column_start+j);
mul_1 = _mm_mul_ps(a_1, b_1);
mul_2 = _mm_mul_ps(a_2, b_1);
mul_3 = _mm_mul_ps(a_3, b_1);
mul_4 = _mm_mul_ps(a_4, b_1);
c_4 = _mm_add_ps(c_4, mul_4);
c_3 = _mm_add_ps(c_3, mul_3);
c_2 = _mm_add_ps(c_2, mul_2);
c_1 = _mm_add_ps(c_1, mul_1);
A_column_start+=m;
a_1 = _mm_loadu_ps(A_column_start+i);
a_2 = _mm_loadu_ps(A_column_start+i+4);
a_3 = _mm_loadu_ps(A_column_start+i+8);
a_4 = _mm_loadu_ps(A_column_start+i+12);
b_1 = _mm_load1_ps(A_column_start+j);
mul_1 = _mm_mul_ps(a_1, b_1);
mul_2 = _mm_mul_ps(a_2, b_1);
mul_3 = _mm_mul_ps(a_3, b_1);
mul_4 = _mm_mul_ps(a_4, b_1);
c_4 = _mm_add_ps(c_4, mul_4);
c_3 = _mm_add_ps(c_3, mul_3);
c_2 = _mm_add_ps(c_2, mul_2);
c_1 = _mm_add_ps(c_1, mul_1);
}
_mm_storeu_ps(C_column_start, c_1);
_mm_storeu_ps(C_column_start+4, c_2);
_mm_storeu_ps(C_column_start+8, c_3);
_mm_storeu_ps(C_column_start+12, c_4);
}
}
However, this is currently giving me almost no speed up. Any tips would be awesome. I’ve been stuck for quite a while now.
First of all, are you sure the loops are parallelizable? What is the range of the value
m?When all three nested loops are parallelizable and
mis sufficiently large (say at least 16 or so), then parallelizing the outer most loop will be most beneficial. Parallelizaing inner loops may have severe forking-joining overhead fromomp parallel for.For the low speedup, here is some checklist:
omp parallel for, all cores are should be utilized.ma huge number? And, what is the computation length? Ifmis huge while the computation is small, the parallel overhead due toomp parallel formay offset the benefit of the parallelization. When you parallelize an inner loop, the trip count (e.g.,m) should not be huge.