I have a simple problem. Having a starting uint_32 value (say 125) and a __m128i of operands to add, for example (+5,+10,-1,-5). What I would like to get as fast as possible is a vector (125 + 5, 125 + 5 + 10, 125 + 5 + 10 – 1, 125 + 5 + 10 – 1 – 5), i.e. cumulatively add values from the operands to the starting value. So far the only solution I can think of is doing an addition of 4 __m128i variables. For the example, they would be
/* pseudoSSE code... */
__m128i src = (125,125,125,125)
__m128i operands =(5,10,-1,-5)
/* Here I omit the partitioning of operands into add1,..add4 for brevity */
__m128i add1 = (+05,+05,+05,+05)
__m128i add2 = (+00,+10,+10,+10)
__m128i add3 = (+00,+00,-01,-01)
__m128i add4 = (+00,+00,+00,-05)
__m128i res1 = _mm_add_epu32( add1, add2 )
__m128i res2 = _mm_add_epu32( add3, add4 )
__m128i res3 = _mm_add_epu32( res1, add2 )
__m128i res = _mm_add_epu32( res3, src )
Like this, I get what I wanted. For this solution I am going to need to set all add_ variables and then perform 4 additions. What Im really asking is whether this can be done faster. Either via some different algo or maybe using some specialized SSE functions that I do not know yet (something like _mm_cumulative_sum()). Many thanks.
Thank you all guys for your help. Determined to figure out which version is the fastest, I wrote a test app.
1/ nonSSE version does all the stuff just as you would expect.
2/ SSE-4 additions does what I proposed, i.e.
3/ SSE-3 additions does what Evgeny proposed, i.e.
The results for
are
Rather surprising…