I have an inline assembler loop that cumulatively adds elements from an int32 data array with MMX instructions. In particular, it uses the fact that the MMX registers can accommodate 16 int32s to calculate 16 different cumulative sums in parallel.
I would now like to convert this piece of code to MMX intrinsics but I am afraid that I will suffer a performance penalty because one cannot explicitly intruct the compiler to use the 8 MMX registers to accomulate 16 independent sums.
Can anybody comment on this and maybe propose a solution on how to convert the piece of code below to use intrinsics?
== inline assembler (only part within the loop) ==
paddd mm0, [esi+edx+8*0] ; add first & second pair of int32 elements
paddd mm1, [esi+edx+8*1] ; add third & fourth pair of int32 elements ...
paddd mm2, [esi+edx+8*2]
paddd mm3, [esi+edx+8*3]
paddd mm4, [esi+edx+8*4]
paddd mm5, [esi+edx+8*5]
paddd mm6, [esi+edx+8*6]
paddd mm7, [esi+edx+8*7] ; add 15th & 16th pair of int32 elements
- esi points to the beginning of the data array
- edx provides the offset in the data array for the current loop iteration
- the data array is arranged such that the elements for the 16 independent sums are interleaved.
The VS2010 does a decent optimization job on the equivalent code using intrinsics. In most cases, it compiles the intrinsic:
into something like:
This isn’t as concise as your
padd mm1, [esi+edx+8*offset], but it arguably comes pretty close. The execution time is likely dominated by the memory fetch.The catch is that VS seems to like adding MMX registers only to other MMX registers. The above scheme works only for the first 7 sums. The 8th sum requires that some register be saved temporarily to memory.
Here’s a complete program and its corresponding compiled assembly (release build):
Here’s the generated assembly code for the inner portion of the sum loop (without its prolog and epilog). Note how most sums are kept efficiently in mm1-mm6. Contrast that with mm0, which is used to bring the number to add to each sum, and with mm7, which serves the last two sums. The 7-sum version of this program doesn’t seem to have mm7 problem.
So what can you do?
Profile the 7-sum and 8-sum intrinsic-based programs. Choose the one that executes quicker.
Profile the version that adds just one MMX register at a time. It should still be able to take advantage of the fact that modern processors fetch 64 to 128 bytes into the cache at a time. It is not obvious that the 8-sum version would be faster than the 1-sum one. The 1-sum version fetches the exact same amount of memory, and does the exact same number of MMX additions. You will need to interleave the inputs accordingly though.
If your target hardware allows it, consider using SSE instructions. Those can add 4 32-bit values at a time. SSE is available in intel CPU’s since the Pentium III in 1999.