I have this loop written in C++, that compiled with MSVC2010 takes a long

Question

0

Asked: May 15, 20262026-05-15T07:36:08+00:00 2026-05-15T07:36:08+00:00

I have this loop written in C++, that compiled with MSVC2010 takes a long

0

I have this loop written in C++, that compiled with MSVC2010 takes a long time to run. (300ms)

    for (int i=0; i<h; i++) {
    for (int j=0; j<w; j++) {
        if (buf[i*w+j] > 0) {
            const int sy = max(0, i - hr);
            const int ey = min(h, i + hr + 1);
            const int sx = max(0, j - hr);
            const int ex = min(w, j + hr + 1);
            float val = 0;
            for (int k=sy; k < ey; k++) {
                for (int m=sx; m < ex; m++) {
                    val += original[k*w + m] * ds[k - i + hr][m - j + hr];
                }
            }
            heat_map[i*w + j] = val;
        }
    }
}

It seemed a bit strange to me, so I did some tests then changed a few bits to inline assembly: (specifically, the code that sums “val”)

    for (int i=0; i<h; i++) {
    for (int j=0; j<w; j++) {
        if (buf[i*w+j] > 0) {
            const int sy = max(0, i - hr);
            const int ey = min(h, i + hr + 1);
            const int sx = max(0, j - hr);
            const int ex = min(w, j + hr + 1);
            __asm {
                fldz
            }
            for (int k=sy; k < ey; k++) {
                for (int m=sx; m < ex; m++) {
                    float val = original[k*w + m] * ds[k - i + hr][m - j + hr];
                    __asm {
                        fld val
                        fadd
                    }
                }
            }
            float val1;
            __asm {
                fstp val1
            }
            heat_map[i*w + j] = val1;
        }
    }
}

Now it runs in half the time, 150ms. It does exactly the same thing, but why is it twice as quick? In both cases it was run in Release mode with optimizations on. Am I doing anything wrong in my original C++ code?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-15T07:36:09+00:00

I suggest you try different floating-point calculation models supported by the compiler – precise, strict or fast (see /fp option) – with your original code before making any conclusions. I suspect that your original code was compiled with some overly restrictive floating-point model (not followed by your assembly in the second version of the code), which is why the original is much slower.

In other words, if the original model was indeed too restrictive, then you were simply comparing apples to oranges. The two versions didn’t really do the same thing, even though it might seem so at the first sight.

Note, for example, that in the first version of the code the intermediate sum is accumulated in a float value. If it was compiled with precise model, the intermediate results would have to be rounded to the precision of float type, even if the variable val was optimized away and the internal FPU register was used instead. In your assembly code you don’t bother to round the accumulated result, which is what could have contributed to its better performance.

I’d suggest you compile both versions of the code in /fp:fast mode and see how their performances compare in that case.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have this loop written in C++, that compiled with MSVC2010 takes a long

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply