Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 3353620
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 18, 20262026-05-18T02:10:24+00:00 2026-05-18T02:10:24+00:00

I am trying to multiply two vectors together where each element of one vector

  • 0

I am trying to multiply two vectors together where each element of one vector is multiplied by the element in the same index at the other vector. I then want to sum all the elements of the resulting vector to obtain one number. For instance, the calculation would look like this for the vectors {1,2,3,4} and {5,6,7,8}:

1*5 + 2*6 + 3*7 + 4*8

Essentially, I am taking the dot product of the two vectors. I know there is an SSE command to do this, but the command doesn’t have an intrinsic function associated with it. At this point, I don’t want to write inline assembly in my C code, so I want to use only intrinsic functions. This seems like a common calculation so I am surprised by myself that I couldn’t find the answer on Google.

Note: I am optimizing for a specific micro architecture which supports up to SSE 4.2.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-18T02:10:25+00:00Added an answer on May 18, 2026 at 2:10 am

    If you’re doing a dot-product of longer vectors, use multiply and regular _mm_add_ps (or FMA) inside the inner loop. Save the horizontal sum until the end.


    But if you are doing a dot product of just a single pair of SIMD vectors:

    GCC (at least version 4.3) includes <smmintrin.h> with SSE4.1 level intrinsics, including the single and double-precision dot products:

    _mm_dp_ps (__m128 __X, __m128 __Y, const int __M);
    _mm_dp_pd (__m128d __X, __m128d __Y, const int __M);
    

    On Intel mainstream CPUs (not Atom/Silvermont) these are somewhat faster than doing it manually with multiple instructions.

    But on AMD (including Ryzen), dpps is significantly slower. (See Agner Fog’s instruction tables)


    As a fallback for older processors, you can use this algorithm to create the dot product of the vectors a and b:

    __m128 r1 = _mm_mul_ps(a, b);
    

    and then horizontal sum r1 using Fastest way to do horizontal float vector sum on x86 (see there for a commented version of this, and why it’s faster.)

    __m128 shuf   = _mm_shuffle_ps(r1, r1, _MM_SHUFFLE(2, 3, 0, 1));
    __m128 sums   = _mm_add_ps(r1, shuf);
    shuf          = _mm_movehl_ps(shuf, sums);
    sums          = _mm_add_ss(sums, shuf);
    float result =  _mm_cvtss_f32(sums);
    

    A slow alternative costs 2 shuffles per hadd, which will easily bottleneck on shuffle throughput, especially on Intel CPUs.

    r2 = _mm_hadd_ps(r1, r1);
    r3 = _mm_hadd_ps(r2, r2);
    _mm_store_ss(&result, r3);
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am trying to embed multiple external websites into one web page. Using an
I am trying to merge multiple excel files using DataTable.Merge() option For Each fileName
I'm trying to use multiple attributes in my custom tag, e.g.: <mytaglib:mytag firstname=Thadeus lastname=Jones
I'm trying to search multiple attributes in XML : <APIS> <API Key=00001> <field Username=username1
I am trying to match multiple CSS style code blocks in a HTML document.
I am trying to replace multiple rows in an Access database to follow a
We are automating Excel using VB.Net, and trying to place multiple lines of text
I'm trying to use the freeware Multiple Find And Replace 1.00 suggested in this
I am trying to have a tooltip on multiple lines. how do i do
I am trying to streamline a complex process of storing information in multiple tables

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.