Sorry I don’t have a good title… I was reading this thread: Vector Matrix

Question

0

Editorial Team

Asked: May 27, 20262026-05-27T05:39:40+00:00 2026-05-27T05:39:40+00:00

Sorry I don’t have a good title… I was reading this thread: Vector Matrix

0

Sorry I don’t have a good title…

I was reading this thread: Vector Matrix Multiplication In SSE

The original poster had the following code

// xmm0 = (v0,v1,v2,v3)
movups xmm0, [eax]

// xmm0 = (v0,v0,v0,v0)
// xmm1 = (v1,v1,v1,v1)
// xmm2 = (v2,v2,v2,v2)
// xmm3 = (v3,v3,v3,v3)
shufps xmm3, xmm0, 255
shufps xmm2, xmm0, 170
shufps xmm1, xmm0, 85
shufps xmm0, xmm0, 0

Someone said the followings:

But what really happens according to the manual: (a, b, c, d) means a are bits 0 to 31, b are bits 32 to 63 and so on

// xmm0 = (v0,v1,v2,v3)
movups xmm0, [eax]

// xmm0 = (v0, v0, v0, v0)
shufps xmm0, xmm0, 0

This makes sense to me since in linear array model [elt0, elt1, elt2, ….] elt0 is Array[0].

What confuses me is, according to the manual the bitmap of xmm register is [127…0] (see the picture below).

I was like the original poster looking at the bitmap and thought the leftmost of [elt0, elt2, elt3, elt4] was the bit “11”.

So if I want xmm0 contains only v0

shufps xmm0, xmm0, 0xFF  // 11 11 11 11  === 0xFF

Which explanation is correct?

enter image description here

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T05:39:41+00:00

There may be some confusion because bits in xmm registers (and all other registers BTW) are numbered right-to-left, i.e. the lowest bit is on the right, and the highest bit is on the left:

xmm0 = [bit 127, bit 126, ..., bit 1, bit 0]

If you consider the content of xmm register as 32-bit dwords, they are also arranged right-to-left:

xmm0 = [dword 3, dword 2, dword 1, dword 0]

The source of this confusion is that if you have an array in memory

float A[4] = { 0.0f, 1.0f, 2.0f, 3.0f };

and you load this array into xmm register, the elements appear in the xmm register in the reversed order:

; xmm0 = (A3 = 3.0f, A2 = 2.0f, A1 = 1.0f, A0 = 0.0f) after the load
movups xmm0, [A]

Therefore, the right way to copy the first dword into all dwords in an xmm register is

shufps xmm0, xmm0, 0

Also, if you want to do load-and-broadcast of a single float into all elements of an xmm register, for performance reasons it is better to use

; MOVSS can be much faster than MOVUPS, and is never slower
; Load A[0] into low dword of xmm0
movss xmm0, [A]
; Copy low dword of xmm0 to all dwords of xmm0
shufps xmm0, xmm0, 0

AVX instruction set (supported in the recent Intel Sandy Bridge and AMD Bulldozer CPUs) has a special instruction vbroadcastss which performs load-and-broadcast:

; xmm0 = (A[0], A[0], A[0], A[0]) after execution of vbroadcastss
vbroadcastss xmm0, [A]

SSE3 instruction set includes a similar instruction MOVDDUP, which, however, only works for doubles

const double B = 2.718281828459045;

; xmm0 = ( 2.718281828459045, 2.718281828459045 ) after execution of movddup
movddup xmm0, [B]

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Sorry I don’t have a good title… I was reading this thread: Vector Matrix

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply