I have a simple problem. Having a starting uint_32 value (say 125) and a

Question

0

Asked: June 13, 20262026-06-13T14:33:06+00:00 2026-06-13T14:33:06+00:00

I have a simple problem. Having a starting uint_32 value (say 125) and a

0

I have a simple problem. Having a starting uint_32 value (say 125) and a __m128i of operands to add, for example (+5,+10,-1,-5). What I would like to get as fast as possible is a vector (125 + 5, 125 + 5 + 10, 125 + 5 + 10 – 1, 125 + 5 + 10 – 1 – 5), i.e. cumulatively add values from the operands to the starting value. So far the only solution I can think of is doing an addition of 4 __m128i variables. For the example, they would be

/* pseudoSSE code... */
__m128i src =     (125,125,125,125)
__m128i operands =(5,10,-1,-5)

/*  Here I omit the partitioning of operands into add1,..add4 for brevity  */

__m128i add1 =    (+05,+05,+05,+05)
__m128i add2 =    (+00,+10,+10,+10)
__m128i add3 =    (+00,+00,-01,-01)
__m128i add4 =    (+00,+00,+00,-05)
__m128i res1 = _mm_add_epu32( add1, add2 )
__m128i res2 = _mm_add_epu32( add3, add4 )
__m128i res3 = _mm_add_epu32( res1, add2 )
__m128i res  = _mm_add_epu32( res3, src  )

Like this, I get what I wanted. For this solution I am going to need to set all add_ variables and then perform 4 additions. What Im really asking is whether this can be done faster. Either via some different algo or maybe using some specialized SSE functions that I do not know yet (something like _mm_cumulative_sum()). Many thanks.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-13T14:33:07+00:00

Thank you all guys for your help. Determined to figure out which version is the fastest, I wrote a test app.

1/ nonSSE version does all the stuff just as you would expect.

int iRep;
int iCycle;
int iVal = 25;
int a1, a2, a3, a4;
int dst1 [4];
for ( iCycle = 0; iCycle < CYCLE_COUNT; iCycle++ )
    for ( iRep = 0; iRep < REP_COUNT; iRep++ )
        {   
          a1 = a2 = a3 = a4 = iRep;
          dst1[0] = iVal + a1;
          dst1[1] = dst1[0] + a2;
          dst1[2] = dst1[1] + a3;
          dst1[3] = dst1[2] + a4;
        }

2/ SSE-4 additions does what I proposed, i.e.

__m128i _a1, _a2, _a3, _a4;
__m128i _res1, _res2, _res3;
__m128i _val;
__m128i _res;

for ( iCycle = 0; iCycle < CYCLE_COUNT; iCycle++ )
    for ( iRep = 0; iRep < REP_COUNT; iRep++ ){ 
        a1 = a2 = a3 = a4 = iRep;

        _val = _mm_set1_epi32( iVal );
        _a1  = _mm_set_epi32 (a1, a1, a1, a1 );
        _a2  = _mm_set_epi32 (a2, a2, a2, 0  );
        _a3  = _mm_set_epi32 (a3, a3, 0 , 0  );
        _a4  = _mm_set_epi32 (a4, 0 , 0 , 0  );

        _res1 = _mm_add_epi32( _a1, _a2     );
        _res2 = _mm_add_epi32( _a3, _a4     );
        _res3 = _mm_add_epi32( _val, _res1  );
        _res  = _mm_add_epi32( _res3, _res2 );
    }

3/ SSE-3 additions does what Evgeny proposed, i.e.

__m128i shift1, shift2, operands ;
for ( iCycle = 0; iCycle < CYCLE_COUNT; iCycle++ )
    for ( iRep = 0; iRep < REP_COUNT; iRep++ ){ 
        a1 = a2 = a3 = a4 = iRep;

        _val = _mm_set1_epi32( iVal );
        operands = _mm_set_epi32(a1,a2,a3,a4);

        shift1 = _mm_add_epi32( operands,               
                 _mm_and_si128(_mm_shuffle_epi32(operands, 0xF9),   _mm_set_epi32(0,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF)   ));
        shift2 = _mm_add_epi32( shift1, 
                 _mm_and_si128(_mm_shuffle_epi32(shift1, 0xFE),     _mm_set_epi32(0,0,0xFFFFFFFF,0xFFFFFFFF)            ));
        _res = _mm_add_epi32(_val, shift2); 
        }

The results for

 #define REP_COUNT    100000
 #define CYCLE_COUNT  100000

are

  non-SSE        -> 6.118s
  SSE-4additions -> 20.775s
  SSE-3additions -> 14.873s

Rather surprising…

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a simple problem. Having a starting uint_32 value (say 125) and a

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply