I wanted to optimize below code using openMP double val; double m_y = 0.0f;

Question

0

Asked: June 18, 20262026-06-18T20:41:03+00:00 2026-06-18T20:41:03+00:00

I wanted to optimize below code using openMP double val; double m_y = 0.0f;

0

I wanted to optimize below code using openMP

double val;
double m_y = 0.0f;
double m_u = 0.0f;
double m_v = 0.0f;

#define _MSE(m, t) \
val = refData[t] - calData[t];  \
m += val*val; 

#pragma omp parallel 
 {
 #pragma omp for
for( i=0; i<(width*height)/2; i++ ) {  //yuv422: 2 pixels at a time
    _MSE(m_u, 0); 
    _MSE(m_y, 1); 
    _MSE(m_v, 2); 
    _MSE(m_y, 3); 

  #pragma omp reduction(+:refData) reduction(+:calData)
    refData += 4;
    calData += 4;
 // int id = omp_get_thread_num();
 //printf("Thread %d performed %d iterations of the loop\n",id ,i);
}

}

Any suggestion welcome for optimizing above code currently I have wrong output.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-18T20:41:04+00:00

I think the easiest thing you can do is allow it to split into 4 threads, and calculate the UYVY errors in each of those. Instead of making them separate values, make them an array:

double sqError[4] = {0};
const int numBytes = width * height * 2;

#pragma omp parallel for
for( int elem = 0; elem < 4; elem++ ) {
    for( int i = elem; i < numBytes; i += 4 ) {
        int val = refData[i] - calData[i];
        sqError[elem] += (double)(val*val);
    }
}

This way, each thread operates exclusively on one thing and there is no contention.

Maybe it’s not the most advanced use of OMP, but you should see a speedup.

After your comment about performance hit, I did some experiments and found that indeed the performance was worse. I suspect this may be due to cache misses.

You said:

performance hit this time with openMP : Time :0.040637 with serial
Time :0.018670

So I reworked it using the reduction on each variable and using a single loop:

    #pragma omp parallel for reduction(+:e0) reduction(+:e1) reduction(+:e2) reduction(+:e3)
    for( int i = 0; i < numBytes; i += 4 ) {
        int val = refData[i] - calData[i];
        e0 += (double)(val*val);
        val = refData[i+1] - calData[i+1];
        e1 += (double)(val*val);
        val = refData[i+2] - calData[i+2];
        e2 += (double)(val*val);
        val = refData[i+3] - calData[i+3];
        e3 += (double)(val*val);
    }

With my test case on a 4-core machine, I observed a little less than 4-fold improvement:

serial:             2025 ms
omp with 2 loops:   6850 ms
omp with reduction: 455  ms

[Edit] On the subject of why the first piece of code performed worse than the non-parallel version, Hristo Iliev said:

Your first piece of code is a terrible example of what false sharing
does in multithreaded codes. As sqError has only 4 elements of 8 bytes
each, it fits in a single cache line (even in a half cache line on
modern x86 CPUs). With 4 threads constantly writing to neighbouring
elements, this would generate a massive amount of inter-core cache
invalidation due to false sharing. One can get around this by using
instead a structure like this struct _error { double val; double
pad[7]; } sqError[4]; Now each sqError[i].val will be in a separate
cache line, hence no false sharing.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I wanted to optimize below code using openMP double val; double m_y = 0.0f;

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply