I am trying to multiply two vectors together where each element of one vector is multiplied by the element in the same index at the other vector. I then want to sum all the elements of the resulting vector to obtain one number. For instance, the calculation would look like this for the vectors {1,2,3,4} and {5,6,7,8}:
1*5 + 2*6 + 3*7 + 4*8
Essentially, I am taking the dot product of the two vectors. I know there is an SSE command to do this, but the command doesn’t have an intrinsic function associated with it. At this point, I don’t want to write inline assembly in my C code, so I want to use only intrinsic functions. This seems like a common calculation so I am surprised by myself that I couldn’t find the answer on Google.
Note: I am optimizing for a specific micro architecture which supports up to SSE 4.2.
If you’re doing a dot-product of longer vectors, use multiply and regular
_mm_add_ps(or FMA) inside the inner loop. Save the horizontal sum until the end.But if you are doing a dot product of just a single pair of SIMD vectors:
GCC (at least version 4.3) includes
<smmintrin.h>with SSE4.1 level intrinsics, including the single and double-precision dot products:On Intel mainstream CPUs (not Atom/Silvermont) these are somewhat faster than doing it manually with multiple instructions.
But on AMD (including Ryzen),
dppsis significantly slower. (See Agner Fog’s instruction tables)As a fallback for older processors, you can use this algorithm to create the dot product of the vectors
aandb:and then horizontal sum
r1using Fastest way to do horizontal float vector sum on x86 (see there for a commented version of this, and why it’s faster.)A slow alternative costs 2 shuffles per
hadd, which will easily bottleneck on shuffle throughput, especially on Intel CPUs.