Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6786933
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 26, 20262026-05-26T17:18:16+00:00 2026-05-26T17:18:16+00:00

This post is closely related to another one I posted some days ago .

  • 0

This post is closely related to another one I posted some days ago. This time, I wrote a simple code that just adds a pair of arrays of elements, multiplies the result by the values in another array and stores it in a forth array, all variables floating point double precision typed.

I made two versions of that code: one with SSE instructions, using calls to and another one without them I then compiled them with gcc and -O0 optimization level. I write them below:

// SSE VERSION

#define N 10000
#define NTIMES 100000
#include <time.h>
#include <stdio.h>
#include <xmmintrin.h>
#include <pmmintrin.h>

double a[N] __attribute__((aligned(16)));
double b[N] __attribute__((aligned(16)));
double c[N] __attribute__((aligned(16)));
double r[N] __attribute__((aligned(16)));

int main(void){
  int i, times;
  for( times = 0; times < NTIMES; times++ ){
     for( i = 0; i <N; i+= 2){ 
        __m128d mm_a = _mm_load_pd( &a[i] );  
        _mm_prefetch( &a[i+4], _MM_HINT_T0 );
        __m128d mm_b = _mm_load_pd( &b[i] );  
        _mm_prefetch( &b[i+4] , _MM_HINT_T0 );
        __m128d mm_c = _mm_load_pd( &c[i] );
        _mm_prefetch( &c[i+4] , _MM_HINT_T0 );
        __m128d mm_r;
        mm_r = _mm_add_pd( mm_a, mm_b );
        mm_a = _mm_mul_pd( mm_r , mm_c );
        _mm_store_pd( &r[i], mm_a );
      }   
   }
 }

//NO SSE VERSION
//same definitions as before
int main(void){
  int i, times;
   for( times = 0; times < NTIMES; times++ ){
     for( i = 0; i < N; i++ ){
      r[i] = (a[i]+b[i])*c[i];
    }   
  }
}

When compiling them with -O0, gcc makes use of XMM/MMX registers and SSE intstructions, if not specifically given the -mno-sse (and others) options. I inspected the assembly code generated for the second code and I noticed that it makes use of movsd, addsd and mulsd instructions. So it makes use of SSE instructions but only of those that use the lowest part of the registers, if I am not wrong. The assembly code generated for the first C code made use, as expected, of the addp and mulpd instructions, though a pretty larger assembly code was generated.

Anyway, the first code should get better profit, as far as I know, of SIMD paradigm, since every iteration two result values are computed. Still that, the second code performs something such as a 25 per cent faster than the first one. I also made a test with single precision values and get similar results. What’s the reason for that?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-26T17:18:17+00:00Added an answer on May 26, 2026 at 5:18 pm

    Vectorization in GCC is enabled at -O3. That’s why at -O0, you see only the ordinary scalar SSE2 instructions (movsd, addsd, etc). Using GCC 4.6.1 and your second example:

    #define N 10000
    #define NTIMES 100000
    
    double a[N] __attribute__ ((aligned (16)));
    double b[N] __attribute__ ((aligned (16)));
    double c[N] __attribute__ ((aligned (16)));
    double r[N] __attribute__ ((aligned (16)));
    
    int
    main (void)
    {
      int i, times;
      for (times = 0; times < NTIMES; times++)
        {
          for (i = 0; i < N; ++i)
            r[i] = (a[i] + b[i]) * c[i];
        }
    
      return 0;
    }
    

    and compiling with gcc -S -O3 -msse2 sse.c produces for the inner loop the following instructions, which is pretty good:

    .L3:
        movapd  a(%eax), %xmm0
        addpd   b(%eax), %xmm0
        mulpd   c(%eax), %xmm0
        movapd  %xmm0, r(%eax)
        addl    $16, %eax
        cmpl    $80000, %eax
        jne .L3
    

    As you can see, with the vectorization enabled GCC emits code to perform two loop iterations in parallel. It can be improved, though – this code uses the lower 128 bits of the SSE registers, but it can use the full the 256-bit YMM registers, by enabling the AVX encoding of SSE instructions (if available on the machine). So, compiling the same program with gcc -S -O3 -msse2 -mavx sse.c gives for the inner loop:

    .L3:
        vmovapd a(%eax), %ymm0
        vaddpd  b(%eax), %ymm0, %ymm0
        vmulpd  c(%eax), %ymm0, %ymm0
        vmovapd %ymm0, r(%eax)
        addl    $32, %eax
        cmpl    $80000, %eax
        jne .L3
    

    Note that v in front of each instruction and that instructions use the 256-bit YMM registers, four iterations of the original loop are executed in parallel.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

This post is a follow-up of a related question posted here by Ran .
This post and this post says that with Visual Studio, the run time library
This post relates to a previous that i posted but the question is different
This post is closely related to my previous post: TDBadgedCell keeps caching the BadgeNumber
this post is a long one sorry for that but the problem is complex
This post identifies a feature that I would like to disable. Current numpy behavior:
This post is very similar to my previous one, but the data structures are
This post says that you need a 64 bit system for MongoDB if your
There are several questions throughout this post all related to the title. The overall
This post from F# News states that F# can inline a function passed as

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.