I just started to use SSE to optimize my code for a computer vision

Question

0

Asked: May 26, 20262026-05-26T21:37:15+00:00 2026-05-26T21:37:15+00:00

I just started to use SSE to optimize my code for a computer vision

0

I just started to use SSE to optimize my code for a computer vision project, aiming at detecting skin color in an image. Below is my function. The function takes a color image and looks at each pixel and returns a probability map. The commented out code was my original C++ implementation and the rest is the SSE version. I timed both of them and it is wierd to find out SSE isn’t any faster than my original C++ code. Any suggestions about what’s going on or how to optimize the function further?

void EvalSkinProb(const Mat& cvmColorImg, Mat& cvmProb)
{
    std::clock_t ts = std::clock();  
    Mat cvmHSV = Mat::zeros(cvmColorImg.rows, cvmColorImg.cols, CV_8UC3);
    cvtColor(cvmColorImg, cvmHSV, CV_BGR2HSV);
    std::clock_t te1 = std::clock(); 

    float fFG, fBG;
    double dp;

    __declspec(align(16)) int frgb[4] = {0};
    __declspec(align(16)) int fBase[4] = {g_iLowHue, g_iLowSat, g_iLowVal, 0};
    __declspec(align(16)) int fIndx[4] = {0};
    __m128i* pSrc1 = (__m128i*) frgb;
    __m128i* pSrc2 = (__m128i*) fBase;
    __m128i* pDest = (__m128i*) fIndx;
    __m128i m1;

    for (int y = 0; y < cvmColorImg.rows; y++)
    {
        for (int x = 0; x < cvmColorImg.cols; x++)
        {
            cv::Vec3b hsv = cvmHSV.at<cv::Vec3b>(y, x);
            frgb[0] = hsv[0];hsv[1] = hsv[1];hsv[2] =hsv[2];
            m1 = _mm_sub_epi32(*pSrc1, *pSrc2);
            *pDest = _mm_srli_epi32(m1, g_iSValPerbinBit); 

            // c++ code
            //fIndx[0] = ((hsv[0]-g_iLowHue)>>g_iSValPerbinBit);
            //fIndx[1] = ((hsv[1]-g_iLowSat)>>g_iSValPerbinBit);
            //fIndx[2] = ((hsv[2]-g_iLowVal)>>g_iSValPerbinBit);

            fFG = m_cvmSkinHist.at<float>(fIndx[0], fIndx[1], fIndx[2]);
            fBG = m_cvmBGHist.at<float>(fIndx[0], fIndx[1], fIndx[2]);
            dp = (double)fFG/(fBG+fFG);
            cvmProb.at<double>(y, x) = dp;          
        }
    }
    std::clock_t te2 = std::clock();  
    double dSecs1 = (double)(te1-ts)/(CLOCKS_PER_SEC);
    double dSecs2 = (double)(te2-te1)/(CLOCKS_PER_SEC);
}

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T21:37:15+00:00

The first problem here is that you’re doing very little SSE work for a tremendous amount of data movement. You’ll spend most of the time just packing/unpacking data in the SSE registers for 2 instructions…

Secondly, there is a very subtle performance penalty that will occur in this code.

You are using a buffer to transfer data between variables and SSE registers. This is a BIG NO-NO.

The reason for this is in the CPU load/store unit. When you write data to a memory location, and then immediately attempt to read it back in a different word size, it usually forces the data to be flushed all the way to cache and re-read. This can incur 20+ cycles of penalty.

This is because CPU load/store units are not optimized for this kind of unusual access.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I just started to use SSE to optimize my code for a computer vision

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply