Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8148643
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 6, 20262026-06-06T14:41:11+00:00 2026-06-06T14:41:11+00:00

Yes, I read SIMD code runs slower than scalar code . No, it’s not

  • 0

Yes, I read SIMD code runs slower than scalar code. No, it’s not really a duplicate.

I have been using 2D math stuff for a while, and in the process of porting my codebase from C to C++. There are a few walls I’ve hit with C that mean I really need polymorphism, but that’s another story. Anyway, I considered this a while ago, but it presented a perfect opportunity to use a 2D vector class, including SSE implementations of the common math operations. Yes, I know there are libraries out there, but I wanted to try it myself to understand what’s going on, and I don’t use anything more complicated than +=.

My implementation is via <immintrin.h>, with a

union {
    __m128d ss;
    struct {
        double x;
        double y;
    }
}

SSE seemed slow, so I looked at its generated ASM output. After fixing something stupid pointerwise, I ended up with the following sets of instructions, run a billion times in a loop:
(Processor is an AMD Phenom II at 3.7GHz)

SSE enabled: 1.1 to 1.8 seconds (varies)

add      $0x1, %eax
addpd    %xmm0, %xmm1
cmp      $0x3b9aca00, %eax
jne      4006c8

SSE disabled: 1.0 seconds (pretty constant)

add      $0x1, %eax
addsd    %xmm0, %xmm3
cmp      $0x3b9aca00, %eax
addsd    %xmm2, %xmm1
jne      400630

The only conclusion I can draw from this is that addsd is faster than addpd, and that pipelining means that the extra instruction is compensated for by the ability to do more faster things partially overlapping.

So my question is: is this worth it, and in practice will it actually help, or should I just not bother with the stupid optimization and let the compiler handle it in scalar mode?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-06T14:41:12+00:00Added an answer on June 6, 2026 at 2:41 pm

    This require more loop unrolling and maybe cache prefetching. Your arithmetic density is very low : 1 operation for 2 memory operations so you need to jam as much of these in your pipeline as possible.

    Also don’t use union but __m128d directly and use _mm_load_pd to fill your __m128 from your data. _m128 in union generate bad code where all element are doing a stack-register-stack dance which is detrimental.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm having some problems using / understanding is_dir. (Yes, I have read the PHP
Yes, I've read the countless questions regarding the very same problem. My code is
I'm not sure I understand V8's architecture (yes, I've read its documentation). In C#
Yes i read the article on sequence points . However i could not understand
Yes, I have seen the other questions and i have read through them and
i HAVE read many Tutorials on posting to twitter using OAuth like here but
Yes, I have read many materials related to operating system. And I am still
Is it possible to read cookie expiration date using JavaScript? If yes, how? If
First of all, yes I have read the other articles deal (I like your
Yes, I did read the 'Related Questions' in the box above after I typed

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.