How to efficiently access the elements of the 256-bit vector? For example I calculated the dot product with
c = _mm256_dp_ps(a, b, 0xff);
How to access the value in c then? I need to get both high part and low part, do I understand correctly that I first need to extract 128 bit parts like this:
r0 = _mm256_extractf128_ps(c,0);
r1 = _mm256_extractf128_ps(c,1);
And only then extract floats:
_MM_EXTRACT_FLOAT(fr0, r0, 0);
_MM_EXTRACT_FLOAT(fr1, r1, 0);
return fr0 + fr1;
There is no efficient way to do this. dp_ps operation itself is slow and subsequent extraction is slow. Unless one can process more data in a batch it’s faster to use SSE4 instructions to calculate dot product and operate with 128 bits than with 256 bits.