Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 3351022
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 18, 20262026-05-18T01:51:30+00:00 2026-05-18T01:51:30+00:00

I am optimizing some code for an Intel x86 Nehalem micro-architecture using SSE intrinsics.

  • 0

I am optimizing some code for an Intel x86 Nehalem micro-architecture using SSE intrinsics.

A portion of my program computes 4 dot products and adds each result to the previous values in a contiguous chunk of an array. More specifically,

tmp0 = _mm_dp_ps(A_0m, B_0m, 0xF1);
tmp1 = _mm_dp_ps(A_1m, B_0m, 0xF2);
tmp2 = _mm_dp_ps(A_2m, B_0m, 0xF4);
tmp3 = _mm_dp_ps(A_3m, B_0m, 0xF8);

tmp0 = _mm_add_ps(tmp0, tmp1);
tmp0 = _mm_add_ps(tmp0, tmp2);
tmp0 = _mm_add_ps(tmp0, tmp3);
tmp0 = _mm_add_ps(tmp0, C_0n);

_mm_storeu_ps(C_2, tmp0);

Notice that I am going about this by using 4 temporary xmm registers to hold the result of each dot product. In each xmm register, the result is placed into a unique 32 bits relative to the other temporary xmm registers such that the end result looks like this:

tmp0= R0-zero-zero-zero

tmp1= zero-R1-zero-zero

tmp2= zero-zero-R2-zero

tmp3= zero-zero-zero-R3

I combine the values contained in each tmp variable into one xmm variable by summing them up with the following instructions:

tmp0 = _mm_add_ps(tmp0, tmp1);
tmp0 = _mm_add_ps(tmp0, tmp2);
tmp0 = _mm_add_ps(tmp0, tmp3);

Finally, I add the register containing all 4 results of the dot products to a contiguous part of an array so that the array’s indexes are incremented by a dot product, like so (C_0n are the 4 values currently in the array that is to be updated; C_2 is the address pointing to these 4 values):

tmp0 = _mm_add_ps(tmp0, C_0n);
_mm_storeu_ps(C_2, tmp0);

I want to know if there is a less round-about, more efficient way to take the results of the dot products and add them to the contiguous chunk of the array. In this way, I am doing 3 additions between registers that only have 1 non-zero value in them. It seems there should be a more effective way to go about this.

I appreciate all help. Thank you.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-18T01:51:30+00:00Added an answer on May 18, 2026 at 1:51 am

    For code like this, I like to store the “transpose” of the A’s and B’s, so that {A_0m.x, A_1m.x, A_2m.x, A_3m.x} are stored in one vector, etc. Then you can do the dot product using just multiplies and adds, and when you’re done, you have all 4 dot products in one vector without any shuffling.

    This is used frequently in raytracing, to test 4 rays at once against a plane (e.g. when traversing a kd-tree). If you don’t have control over the input data, though, the overhead of doing the transpose might not be worth it. The code will also run on pre-SSE4 machines, although that might not be an issue.


    A small efficiency note on the existing code: instead of this

    tmp0 = _mm_add_ps(tmp0, tmp1);
    tmp0 = _mm_add_ps(tmp0, tmp2);
    tmp0 = _mm_add_ps(tmp0, tmp3);
    tmp0 = _mm_add_ps(tmp0, C_0n);
    

    It may be slightly better to do this:

    tmp0 = _mm_add_ps(tmp0, tmp1);  // 0 + 1 -> 0
    tmp2 = _mm_add_ps(tmp2, tmp3);  // 2 + 3 -> 2
    tmp0 = _mm_add_ps(tmp0, tmp2);  // 0 + 2 -> 0
    tmp0 = _mm_add_ps(tmp0, C_0n);
    

    As the first two mm_add_ps‘s are completely independent now. Also, I don’t know the relative timings of adding vs. shuffling, but that might be slightly faster.


    Hope that helps.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

So basically i am struggling with the need of optimizing my code for some
I'm working on some code which contains some (compiler generated) chunks of assembly code
I have a piece of code for some validation logic, which in generalized for
I could really use some help optimizing a table on my website that is
Thanks to some very helpful stackOverflow users at Bit twiddling: which bit is set?
I am having trouble optimizing a relatively simple query involving a GROUP BY, ORDER
I was asked to have a look at a legacy EJB3 application with significant
I was looking up the keyword volatile and what it's for, and the answer
I am trying to use R to estimate a multinomial logit model with a
I'm trying to replicate a bug which the client has reported, it's the this

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.