I am new to SSE and SSE2, and I wrote a small C sample

Question

0

Asked: May 24, 20262026-05-24T17:51:03+00:00 2026-05-24T17:51:03+00:00

I am new to SSE and SSE2, and I wrote a small C sample

0

I am new to SSE and SSE2, and I wrote a small C sample (allocating two counters, one increasing other decreasing than adding the two), which is working as expected. I used intrinsics and Microsoft Visual Studio 10 C++ Express. As second step I wanted to understand what’s going on under the hood, but I’m puzzled now.
For example the assignment operation in the for loops compiles to:

__m128i a_ptr = _mm_load_si128((__m128i*)&(a_aligned[i]));
 mov         eax,dword ptr [i]  
 mov         ecx,dword ptr [a_aligned]  
 movdqa      xmm0,xmmword ptr [ecx+eax*2]  
 movdqa      xmmword ptr [ebp-1C0h],xmm0  
 movdqa      xmm0,xmmword ptr [ebp-1C0h]  
 movdqa      xmmword ptr [a_ptr],xmm0

I understand that the first two lines gets the components of a_aligned’s address, and the third line copies it to the xmm0 register. But I don’t understand why it’s copied back to memory, than to xmm0 again (than to a_ptr). I though that the _mm_load_si128 intrinsic should copy a_aligned[i]’s 128 bits to xmm0 and nothing more. Why is this happened? Am I wrong theoretically? If not how should I hint the compiler? Is my sample code correct (in sense that it doesn’t have unnecessarities)?
Here is my full sample code:

#include <xmmintrin.h>
#include <emmintrin.h>
#include <iostream>

int main(int argc, char *argv[]) {
    unsigned __int16 *a_aligned = (unsigned __int16 *)_mm_malloc(32 * sizeof(unsigned __int16),16);
    unsigned __int16 *b_aligned = (unsigned __int16 *)_mm_malloc(32 * sizeof(unsigned __int16),16);
    unsigned __int16 *c_aligned = (unsigned __int16 *)_mm_malloc(32 * sizeof(unsigned __int16),16);

    for(int i = 0; i < 32; i++) {
        a_aligned[i] = i;
        b_aligned[i] = i;
        c_aligned[i] = 0;
    }

    for(int i = 0; i < 32; i+=8) {
        __m128i a_ptr = _mm_load_si128((__m128i*)&(a_aligned[i]));
        __m128i b_ptr = _mm_load_si128((__m128i*)&(b_aligned[i]));
        __m128i res = _mm_add_epi16(a_ptr, b_ptr);
        _mm_store_si128((__m128i*)&(c_aligned[i]), res);
    }

    for(int i = 1; i < 32; i++) {
        std::cout << c_aligned[i] << " ";
    }

    _mm_free(a_aligned);
    _mm_free(b_aligned);
    _mm_free(c_aligned);
    return 0;
}

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-24T17:51:04+00:00

Editorial Team

2026-05-24T17:51:04+00:00Added an answer on May 24, 2026 at 5:51 pm

Turn on optimization in your compiler settings (use the Release configuration instead of Debug).

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am new to SSE and SSE2, and I wrote a small C sample

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply