Assume I have two vectors represented by two arrays of type double, each of size 2. I’d like to add corresponding positions. So assume vectors i0 and i1, I’d like to add i0[0] + i1[0] and i0[1] + i1[1] together.
Since the type is double, I would need two registers. The trick would be to put i0[0] and i1[0] , and i0[1] and i1[1] in another and just add the register with itself.
My question is, if I call _mm_load_ps(i0[0]) and then _mm_load_ps(i1[0]), will that place them in the lower and upper 64-bits separately, or will it replace the register with the second load? How would I place both doubles in the same register, so I can call add_ps after?
I think what you want is this:
When you do a
_mm_load_pd, it puts the first double into the lower 64 bits of the register and the second into the upper 64 bits. So, after the loads above,x1holds the twodoublevaluesi0[0]andi0[1](and similar forx2). The call to_mm_add_pdvertically adds the corresponding elements inx1andx2, so after the addition,sumholdsi0[0] + i1[0]in its lower 64 bits andi0[1] + i1[1]in its upper 64 bits.Edit: I should point out that there is no benefit to using
_mm_load_pdinstead of_mm_load_ps. As the function names indicate, thepdvariety explicitly loads two packed doubles and thepsversion loads four packed single-precision floats. Since these are purely bit-for-bit memory moves and they both use the SSE floating-point unit, there is no penalty to using_mm_load_psto load indoubledata. And, there is a benefit to_mm_load_ps: its instruction encoding is one byte shorter than_mm_load_pd, so it is more efficient from an instruction cache sense (and potentially instruction decoding; I’m not an expert on all of the intricacies of modern x86 processors). The above code using_mm_load_pswould look like:There is no function implied by the casts; it simply makes the compiler reinterpret the SSE register’s contents as holding doubles instead of floats so that it can be passed into the double-precision arithmetic function
_mm_add_pd.