The Intel Advanced Vector Extensions (AVX) offers no dot product in the 256-bit version (YMM register) for double precision floating point variables. The “Why?” question have been very briefly treated in another forum (here) and on Stack Overflow (here). But the question I am facing is how to replace this missing instruction with other AVX instructions in an efficient way?
The dot product in 256-bit version exists for single precision floating point variables (reference here):
__m256 _mm256_dp_ps(__m256 m1, __m256 m2, const int mask);
The idea is to find an efficient equivalent for this missing instruction:
__m256d _mm256_dp_pd(__m256d m1, __m256d m2, const int mask);
To be more specific, the code I would like to transform from __m128 (four floats) to __m256d (4 doubles) use the following instructions:
__m128 val0 = ...; // Four float values
__m128 val1 = ...; //
__m128 val2 = ...; //
__m128 val3 = ...; //
__m128 val4 = ...; //
__m128 res = _mm_or_ps( _mm_dp_ps(val1, val0, 0xF1),
_mm_or_ps( _mm_dp_ps(val2, val0, 0xF2),
_mm_or_ps( _mm_dp_ps(val3, val0, 0xF4),
_mm_dp_ps(val4, val0, 0xF8) )));
The result of this code is a _m128 vector of four floats containing the results of the dot products between val1 and val0, val2 and val0, val3 and val0, val4 and val0.
Maybe this can give hints for the suggestions?
I would use a 4*double multiplication, then a
hadd(which unfortunately adds only 2*2 floats in the upper and lower half), extract the upper half (a shuffle should work equally, maybe faster) and add it to the lower half.The result is in the low 64 bit of
dotproduct.Edit:
After an idea of Norbert P. I extended this version to do 4 dot products at one time.