My initial attempt looked like this (supposed we want to multiply)
__m128 mat[n]; /* rows */
__m128 vec[n] = {1,1,1,1};
float outvector[n];
for (int row=0;row<n;row++) {
for(int k =3; k < 8; k = k+ 4)
{
__m128 mrow = mat[k];
__m128 v = vec[row];
__m128 sum = _mm_mul_ps(mrow,v);
sum= _mm_hadd_ps(sum,sum); /* adds adjacent-two floats */
}
_mm_store_ss(&outvector[row],_mm_hadd_ps(sum,sum));
}
But this clearly doesn’t work. How do I approach this?
I should load 4 at a time….
The other question is: if my array is very big (say n = 1000), how can I make it 16-bytes aligned? Is that even possible?
OK… I’ll use a row-major matrix convention. Each row of
[m]requires (2) __m128 elements to yield 8 floats. The 8×1 vectorvis a column vector. Since you’re using thehaddpsinstruction, I’ll assume SSE3 is available. Findingr = [m] * v:As for alignment, a variable of a type __m128 should be automatically aligned on the stack. With dynamic memory, this is not a safe assumption. Some malloc / new implementations may only return memory guaranteed to be 8-byte aligned.
The intrinsics header provides _mm_malloc and _mm_free. The align parameter should be (16) in this case.