This is the first time I am posting a question on stackoverflow, so please

Question 1

This is the first time I am posting a question on stackoverflow, so please try and overlook any errors I may have made in formatting my question/code. But please do point the same out to me so I may be more careful.

I was trying to write some simple intrinsics routines for the addition of two 128-bit (containing 4 float variables) numbers. I found some code on the net and was trying to get it to run on my system. The code is as follows:

 //this is a sample Intrinsics program to add two vectors.

    #include <iostream>  
    #include <iomanip>      
    #include <xmmintrin.h>  
    #include <stdio.h>

    using namespace std;

    struct vector4 {  
        float x, y, z, w;    };   

    //functions to operate on them.  
    vector4 set_vector(float x, float y, float z, float w = 0) {     
        vector4 temp;  
        temp.x = x;   
        temp.y = y;   
        temp.z = z;  
        temp.w = w;  
        return temp;  
    }    


    void print_vector(const vector4& v) {   
        cout << " This is the contents of vector: " << endl;  
        cout << " > vector.x = " << v.x << endl;  
        cout << " vector.y = " << v.y << endl;  
        cout << " vector.z = " << v.z << endl;  
        cout << " vector.w = " << v.w << endl;  
    }

    vector4 sse_vector4_add(const vector4&a, const vector4& b) {  
        vector4 result;  

        asm volatile (  
          "movl $a, %eax" //move operands into registers.  
          "\n\tmovl $b, %ebx"  
          "\n\tmovups  (%eax), xmm0"  //move register contents into SSE registers.  
          "\n\tmovups (%ebx), xmm1"  
          "\n\taddps xmm0, xmm1" //add the elements. addps operates on single-precision vectors.    
          "\n\t movups xmm0, result" //move result into vector4 type data.  
        );
        return result;  
    }

    int main() {     
        vector4 a, b, result;  
        a = set_vector(1.1, 2.1, 3.2, 4.5);   
        b = set_vector(2.2, 4.2, 5.6);    
        result = sse_vector4_add(a, b);    
        print_vector(a);  
        print_vector(b);    
        print_vector(result);
        return 0;
    }

The g++ parameters I use are:

g++ -Wall -pedantic -g -march=i386 -msse intrinsics_SSE_example.C -o h

The errors I get are as follows:

intrinsics_SSE_example.C: Assembler messages:  
intrinsics_SSE_example.C:45: Error: too many memory references for movups  
intrinsics_SSE_example.C:46: Error: too many memory references for movups  
intrinsics_SSE_example.C:47: Error: too many memory references for addps  
intrinsics_SSE_example.C:48: Error: too many memory references for movups

I have spent a lot of time on trying to debug these errors, googled them and so on. I am a complete noob to Intrinsics and so may have overlooked some important things.

Any help is appreciated,
Thanks,
Sriram.

Question 2

You’re using ASM blocks, not intrinsic.

Since those xmmX are registers, you should prefix them with a %:

      "\n\tmovups  (%eax), %xmm0"
      // etc.

And your ASM is has several errors.

you should not modify the ebx register.
$a etc is considered a global symbol in the assembler, which it is not.
addps %xmm0, %xmm1 will store the result into xmm1. Remember in AT&T syntax the destination is on the right.

The corrected ASM block would be like

    asm volatile (  
      "movl %1, %%eax"
      "\n\tmovl %2, %%ecx"  
      "\n\tmovups  (%%eax), %%xmm0"
      "\n\tmovups (%%ecx), %%xmm1"  
      "\n\taddps %%xmm0, %%xmm1"
      "\n\tmovups %%xmm0, %0"
      : "=m"(result)
      : "r"(&a), "r"(&b)
      : "eax", "ecx");

Basically, %0 will be replaced by the address of result, %1 and %2 will be replaced by &a and &b. See http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html for a detailed explanation. The "eax", "ecx" prevents these 2 registers from being used as a replacement of those %n.

But the first 2 movl‘s are unnecessary…

    asm volatile(  
      "\n\tmovups (%1), %%xmm0"
      "\n\tmovups (%2), %%xmm1"  
      "\n\taddps %%xmm1, %%xmm0"
      "\n\tmovups %%xmm0, %0"
      : "=m"(result)
      : "r"(&a), "r"(&b));

Since you mentioned intrinsic, why not use __builtin_ia32_addps?

Editorial Team · Answer 1 · 2026-05-15T01:24:04+00:00

You’re using ASM blocks, not intrinsic.

Since those xmmX are registers, you should prefix them with a %:

      "\n\tmovups  (%eax), %xmm0"
      // etc.

And your ASM is has several errors.

you should not modify the ebx register.
$a etc is considered a global symbol in the assembler, which it is not.
addps %xmm0, %xmm1 will store the result into xmm1. Remember in AT&T syntax the destination is on the right.

The corrected ASM block would be like

    asm volatile (  
      "movl %1, %%eax"
      "\n\tmovl %2, %%ecx"  
      "\n\tmovups  (%%eax), %%xmm0"
      "\n\tmovups (%%ecx), %%xmm1"  
      "\n\taddps %%xmm0, %%xmm1"
      "\n\tmovups %%xmm0, %0"
      : "=m"(result)
      : "r"(&a), "r"(&b)
      : "eax", "ecx");

Basically, %0 will be replaced by the address of result, %1 and %2 will be replaced by &a and &b. See http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html for a detailed explanation. The "eax", "ecx" prevents these 2 registers from being used as a replacement of those %n.

But the first 2 movl‘s are unnecessary…

    asm volatile(  
      "\n\tmovups (%1), %%xmm0"
      "\n\tmovups (%2), %%xmm1"  
      "\n\taddps %%xmm1, %%xmm0"
      "\n\tmovups %%xmm0, %0"
      : "=m"(result)
      : "r"(&a), "r"(&b));

Since you mentioned intrinsic, why not use __builtin_ia32_addps?

Editorial Team
2026-05-15T01:24:04+00:00Added an answer on May 15, 2026 at 1:24 am

You’re using ASM blocks, not intrinsic.

Since those xmmX are registers, you should prefix them with a %:

"\n\tmovups (%eax), %xmm0" // etc.

And your ASM is has several errors.

you should not modify the ebx register.

$a etc is considered a global symbol in the assembler, which it is not.

addps %xmm0, %xmm1 will store the result into xmm1. Remember in AT&T syntax the destination is on the right.

The corrected ASM block would be like

asm volatile ( "movl %1, %%eax" "\n\tmovl %2, %%ecx" "\n\tmovups (%%eax), %%xmm0" "\n\tmovups (%%ecx), %%xmm1" "\n\taddps %%xmm0, %%xmm1" "\n\tmovups %%xmm0, %0" : "=m"(result) : "r"(&a), "r"(&b) : "eax", "ecx");

Basically, %0 will be replaced by the address of result, %1 and %2 will be replaced by &a and &b. See http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html for a detailed explanation. The "eax", "ecx" prevents these 2 registers from being used as a replacement of those %n.

But the first 2 movl‘s are unnecessary…

asm volatile( "\n\tmovups (%1), %%xmm0" "\n\tmovups (%2), %%xmm1" "\n\taddps %%xmm1, %%xmm0" "\n\tmovups %%xmm0, %0" : "=m"(result) : "r"(&a), "r"(&b));

Since you mentioned intrinsic, why not use __builtin_ia32_addps?

0

Reply

Share
Share

Share on Facebook

Share on Twitter

Share on LinkedIn

Share on WhatsApp

Report — Editorial Team, 2026-05-15T01:24:04+00:00Added an answer on May 15, 2026 at 1:24 am

How to approach applying for a job at a company ...

What is a programmer’s life like?

How to handle personal stress caused by utterly incompetent and ...

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

This is the first time I am posting a question on stackoverflow, so please

Leave an answerCancel reply

1 Answer

How to approach applying for a job at a company ...

What is a programmer’s life like?

How to handle personal stress caused by utterly incompetent and ...

Leave an answer
Cancel reply