Sorry I don’t have a good title…
I was reading this thread: Vector Matrix Multiplication In SSE
The original poster had the following code
// xmm0 = (v0,v1,v2,v3)
movups xmm0, [eax]
// xmm0 = (v0,v0,v0,v0)
// xmm1 = (v1,v1,v1,v1)
// xmm2 = (v2,v2,v2,v2)
// xmm3 = (v3,v3,v3,v3)
shufps xmm3, xmm0, 255
shufps xmm2, xmm0, 170
shufps xmm1, xmm0, 85
shufps xmm0, xmm0, 0
Someone said the followings:
But what really happens according to the manual: (a, b, c, d) means a are bits 0 to 31, b are bits 32 to 63 and so on
// xmm0 = (v0,v1,v2,v3)
movups xmm0, [eax]
// xmm0 = (v0, v0, v0, v0)
shufps xmm0, xmm0, 0
This makes sense to me since in linear array model [elt0, elt1, elt2, ….] elt0 is Array[0].
What confuses me is, according to the manual the bitmap of xmm register is [127…0] (see the picture below).
I was like the original poster looking at the bitmap and thought the leftmost of [elt0, elt2, elt3, elt4] was the bit “11”.
So if I want xmm0 contains only v0
shufps xmm0, xmm0, 0xFF // 11 11 11 11 === 0xFF
Which explanation is correct?

There may be some confusion because bits in xmm registers (and all other registers BTW) are numbered right-to-left, i.e. the lowest bit is on the right, and the highest bit is on the left:
If you consider the content of xmm register as 32-bit dwords, they are also arranged right-to-left:
The source of this confusion is that if you have an array in memory
and you load this array into xmm register, the elements appear in the xmm register in the reversed order:
Therefore, the right way to copy the first dword into all dwords in an xmm register is
Also, if you want to do load-and-broadcast of a single float into all elements of an xmm register, for performance reasons it is better to use
AVX instruction set (supported in the recent Intel Sandy Bridge and AMD Bulldozer CPUs) has a special instruction vbroadcastss which performs load-and-broadcast:
SSE3 instruction set includes a similar instruction MOVDDUP, which, however, only works for doubles