I have noticed that sometimes MSVC 2010 doesn’t reorder SSE instructions at all. I thought I didn’t have to care about instruction order inside my loop since the compiler handles that best, which doesn’t seem to be the case.
How should I think about this? What determines the best instruction order? I know some instruction have higher latency than others and that some instructions can run in parallel/async on cpu level. What metrics are relevant in the context? Where can I find them?
I know that I could avoid this question by profiling, however such profilers are expensive (VTune XE) and I would like to know the theory behind it, not just emperical results.
Also should I care about software prefetching (_mm_prefetch) or can I assume that the cpu will do a better job than me?
Lets say I have the following function. Should I interleave some of the instructions? Should I do the stores before the streams, do all the loads in order and then do calculations, etc…? Do I need to consider USWC vs non-USWC, and temporal vs non-temporal?
auto cur128 = reinterpret_cast<__m128i*>(cur);
auto prev128 = reinterpret_cast<const __m128i*>(prev);
auto dest128 = reinterpret_cast<__m128i*>(dest;
auto end = cur128 + count/16;
while(cur128 != end)
{
auto xmm0 = _mm_add_epi8(_mm_load_si128(cur128+0), _mm_load_si128(prev128+0));
auto xmm1 = _mm_add_epi8(_mm_load_si128(cur128+1), _mm_load_si128(prev128+1));
auto xmm2 = _mm_add_epi8(_mm_load_si128(cur128+2), _mm_load_si128(prev128+2));
auto xmm3 = _mm_add_epi8(_mm_load_si128(cur128+3), _mm_load_si128(prev128+3));
// dest128 is USWC memory
_mm_stream_si128(dest128+0, xmm0);
_mm_stream_si128(dest128+1, xmm1);
_mm_stream_si128(dest128+2, xmm2);;
_mm_stream_si128(dest128+3, xmm3);
// cur128 is temporal, and will be used next time, which is why I choose store over stream
_mm_store_si128 (cur128+0, xmm0);
_mm_store_si128 (cur128+1, xmm1);
_mm_store_si128 (cur128+2, xmm2);
_mm_store_si128 (cur128+3, xmm3);
cur128 += 4;
dest128 += 4;
prev128 += 4;
}
std::swap(cur, prev);
I agree with everyone that testing and tweaking is the best approach. But there are some tricks to help it.
First of all, MSVC does re-order SSE instruction. Your example is probably too simple or already optimal.
Generally speaking, if you have enough registers to do so, full interleaving tends to give the best results. To take it a step further, unroll your loops enough to use all the registers, but not too much to spill.
In your example, the loop is completely bound by memory accesses, so there isn’t much room to do any better.
In most cases, it isn’t necessary to get the order of the instructions perfect to achieve optimal performance. As long as it’s “close enough”, either the compiler, or the hardware’s out-of-order execution will fix it for you.
The method I use to determine if my code is optimal is critical-path and bottleneck analysis. After I write the loop, I look up what instructions use which resources. Using that information, I can calculate upper-bound on performance, which I then compare with the actual results to see how close/far I am from optimal.
For example, suppose I have a loop with 100 adds and 50 multiplies. On both Intel and AMD (pre-Bulldozer), each core can sustain one SSE/AVX add and one SSE/AVX multiply per cycle.
Since my loop has 100 adds, I know I cannot do any better than 100 cycles. Yes, the multiplier will be idle half the time, but the adder is the bottleneck.
Now I go and time my loop and I get 105 cycles per iteration. That means I’m pretty close to optimal and there’s not much more to gain. But if I get 250 cycles, then that means something’s wrong with the loop and it’s worth tinkering with it more.
Critical-path analysis follows the same idea. Look up the latencies for all the instructions and find the cycle time of the critical path of the loop. If your actual performance is very close to it, you’re already optimal.
Agner Fog has a great reference for the internal details of the current processors:
http://www.agner.org/optimize/microarchitecture.pdf