I’ve been struggling for a while with the performance of the network coding in an application I’m developing (see Optimzing SSE-code, Improving performance of network coding-encoding and OpenCL distribution). Now I’m quite close to achieve acceptable performance. This is the current state of the innermost loop (which is where >99% of the execution time is being spent):
while(elementIterations-- >0)
{
unsigned int firstMessageField = *(currentMessageGaloisFieldsArray++);
unsigned int secondMessageField = *(currentMessageGaloisFieldsArray++);
__m128i valuesToMultiply = _mm_set_epi32(0, secondMessageField, 0, firstMessageField);
__m128i mulitpliedHalves = _mm_mul_epu32(valuesToMultiply, fragmentCoefficentVector);
}
Do you have any suggestions on how to further optimize this? I understand that it’s hard to do without more context but any help is appreciated!
Now that I’m awake, here’s my answer:
In your original code, the bottleneck is almost certainly
_mm_set_epi32. This single intrinsic gets compiled into this mess in your assembly:What is this? 9 instructions?!?!?! Pure overhead…
Another place that seems odd is that the compiler didn’t merge the adds and loads:
should have been merged into:
I’m not sure if the compiler went brain-dead, or if it actually had a legitimate reason to do that… Anyways, it’s a small thing compared to the
_mm_set_epi32.Disclaimer: The code I will present from here on violates strict-aliasing. But non-standard compliant methods are often needed to achieve maximum performance.
Solution 1: No Vectorization
This solution assumes
allZerois really all zeros.The loop is actually simpler than it looks. Since there isn’t a lot of arithmetic, it might be better to just not vectorize:
Which compiles to this on x64:
and this on x86:
It’s possible that both of these are already faster than your original (SSE) code… On x64, Unrolling it will make it even better.
Solution 2: SSE2 Integer Shuffle
This solution unrolls the loop to 2 iterations:
which gets compiled to this (x86): (x64 isn’t too different)
Only slightly longer than the non-vectorized version for two iterations. This uses very few registers, so you can further unroll this even on x86.
Explanations:
_mm_set_epi32(which gets compiled into about ~9 instructions in your original code) can be replaced with a single_mm_shuffle_epi32.