I am optimizing some code for an Intel x86 Nehalem micro-architecture using SSE intrinsics.

Question

0

Asked: May 18, 20262026-05-18T01:51:30+00:00 2026-05-18T01:51:30+00:00

I am optimizing some code for an Intel x86 Nehalem micro-architecture using SSE intrinsics.

0

I am optimizing some code for an Intel x86 Nehalem micro-architecture using SSE intrinsics.

A portion of my program computes 4 dot products and adds each result to the previous values in a contiguous chunk of an array. More specifically,

tmp0 = _mm_dp_ps(A_0m, B_0m, 0xF1);
tmp1 = _mm_dp_ps(A_1m, B_0m, 0xF2);
tmp2 = _mm_dp_ps(A_2m, B_0m, 0xF4);
tmp3 = _mm_dp_ps(A_3m, B_0m, 0xF8);

tmp0 = _mm_add_ps(tmp0, tmp1);
tmp0 = _mm_add_ps(tmp0, tmp2);
tmp0 = _mm_add_ps(tmp0, tmp3);
tmp0 = _mm_add_ps(tmp0, C_0n);

_mm_storeu_ps(C_2, tmp0);

Notice that I am going about this by using 4 temporary xmm registers to hold the result of each dot product. In each xmm register, the result is placed into a unique 32 bits relative to the other temporary xmm registers such that the end result looks like this:

tmp0= R0-zero-zero-zero

tmp1= zero-R1-zero-zero

tmp2= zero-zero-R2-zero

tmp3= zero-zero-zero-R3

I combine the values contained in each tmp variable into one xmm variable by summing them up with the following instructions:

tmp0 = _mm_add_ps(tmp0, tmp1);
tmp0 = _mm_add_ps(tmp0, tmp2);
tmp0 = _mm_add_ps(tmp0, tmp3);

Finally, I add the register containing all 4 results of the dot products to a contiguous part of an array so that the array’s indexes are incremented by a dot product, like so (C_0n are the 4 values currently in the array that is to be updated; C_2 is the address pointing to these 4 values):

tmp0 = _mm_add_ps(tmp0, C_0n);
_mm_storeu_ps(C_2, tmp0);

I want to know if there is a less round-about, more efficient way to take the results of the dot products and add them to the contiguous chunk of the array. In this way, I am doing 3 additions between registers that only have 1 non-zero value in them. It seems there should be a more effective way to go about this.

I appreciate all help. Thank you.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-18T01:51:30+00:00

For code like this, I like to store the “transpose” of the A’s and B’s, so that {A_0m.x, A_1m.x, A_2m.x, A_3m.x} are stored in one vector, etc. Then you can do the dot product using just multiplies and adds, and when you’re done, you have all 4 dot products in one vector without any shuffling.

This is used frequently in raytracing, to test 4 rays at once against a plane (e.g. when traversing a kd-tree). If you don’t have control over the input data, though, the overhead of doing the transpose might not be worth it. The code will also run on pre-SSE4 machines, although that might not be an issue.

A small efficiency note on the existing code: instead of this

tmp0 = _mm_add_ps(tmp0, tmp1);
tmp0 = _mm_add_ps(tmp0, tmp2);
tmp0 = _mm_add_ps(tmp0, tmp3);
tmp0 = _mm_add_ps(tmp0, C_0n);

It may be slightly better to do this:

tmp0 = _mm_add_ps(tmp0, tmp1);  // 0 + 1 -> 0
tmp2 = _mm_add_ps(tmp2, tmp3);  // 2 + 3 -> 2
tmp0 = _mm_add_ps(tmp0, tmp2);  // 0 + 2 -> 0
tmp0 = _mm_add_ps(tmp0, C_0n);

As the first two mm_add_ps‘s are completely independent now. Also, I don’t know the relative timings of adding vs. shuffling, but that might be slightly faster.

Hope that helps.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am optimizing some code for an Intel x86 Nehalem micro-architecture using SSE intrinsics.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply