The SSE shift instructions I have found can only shift by the same amount on all the elements:
_mm_sll_epi32()_mm_slli_epi32()
These shift all elements, but by the same shift amount.
Is there a way to apply different shifts to the different elements? Something like this:
__m128i a, __m128i b;
r0:= a0 << b0;
r1:= a1 << b1;
r2:= a2 << b2;
r3:= a3 << b3;
There exists the
_mm_shl_epi32()intrinsic that does exactly that.http://msdn.microsoft.com/en-us/library/gg445138.aspx
However, it requires the XOP instruction set. Only AMD Bulldozer and Interlagos processors or later have this instruction. It is not available on any Intel processor.
If you want to do it without XOP instructions, you will need to do it the hard way: Pull them out and do them one by one.
Without XOP instructions, you can do this with SSE4.1 using the following intrinsics:
_mm_insert_epi32()_mm_extract_epi32()http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_sse41_reg_ins_ext.htm
Those will let you extract parts of a 128-bit register into regular registers to do the shift and put them back.
If you go with the latter method, it’ll be horrifically inefficient. That’s why
_mm_shl_epi32()exists in the first place.