Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6472993
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 25, 20262026-05-25T06:24:45+00:00 2026-05-25T06:24:45+00:00

I have noticed that sometimes MSVC 2010 doesn’t reorder SSE instructions at all. I

  • 0

I have noticed that sometimes MSVC 2010 doesn’t reorder SSE instructions at all. I thought I didn’t have to care about instruction order inside my loop since the compiler handles that best, which doesn’t seem to be the case.

How should I think about this? What determines the best instruction order? I know some instruction have higher latency than others and that some instructions can run in parallel/async on cpu level. What metrics are relevant in the context? Where can I find them?

I know that I could avoid this question by profiling, however such profilers are expensive (VTune XE) and I would like to know the theory behind it, not just emperical results.

Also should I care about software prefetching (_mm_prefetch) or can I assume that the cpu will do a better job than me?

Lets say I have the following function. Should I interleave some of the instructions? Should I do the stores before the streams, do all the loads in order and then do calculations, etc…? Do I need to consider USWC vs non-USWC, and temporal vs non-temporal?

            auto cur128     = reinterpret_cast<__m128i*>(cur);
            auto prev128    = reinterpret_cast<const __m128i*>(prev);
            auto dest128    = reinterpret_cast<__m128i*>(dest;
            auto end        = cur128 + count/16;

            while(cur128 != end)            
            {
                auto xmm0 = _mm_add_epi8(_mm_load_si128(cur128+0), _mm_load_si128(prev128+0));
                auto xmm1 = _mm_add_epi8(_mm_load_si128(cur128+1), _mm_load_si128(prev128+1));
                auto xmm2 = _mm_add_epi8(_mm_load_si128(cur128+2), _mm_load_si128(prev128+2));
                auto xmm3 = _mm_add_epi8(_mm_load_si128(cur128+3), _mm_load_si128(prev128+3));

                                    // dest128 is USWC memory
                _mm_stream_si128(dest128+0, xmm0);  
                _mm_stream_si128(dest128+1, xmm1);
                _mm_stream_si128(dest128+2, xmm2);;
                _mm_stream_si128(dest128+3, xmm3);

                                    // cur128 is temporal, and will be used next time, which is why I choose store over stream
                _mm_store_si128 (cur128+0, xmm0);               
                _mm_store_si128 (cur128+1, xmm1);                   
                _mm_store_si128 (cur128+2, xmm2);                   
                _mm_store_si128 (cur128+3, xmm3);

                cur128  += 4;
                dest128 += 4;
                prev128 += 4;
            }

           std::swap(cur, prev);
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-25T06:24:45+00:00Added an answer on May 25, 2026 at 6:24 am

    I agree with everyone that testing and tweaking is the best approach. But there are some tricks to help it.

    First of all, MSVC does re-order SSE instruction. Your example is probably too simple or already optimal.

    Generally speaking, if you have enough registers to do so, full interleaving tends to give the best results. To take it a step further, unroll your loops enough to use all the registers, but not too much to spill.
    In your example, the loop is completely bound by memory accesses, so there isn’t much room to do any better.

    In most cases, it isn’t necessary to get the order of the instructions perfect to achieve optimal performance. As long as it’s “close enough”, either the compiler, or the hardware’s out-of-order execution will fix it for you.

    The method I use to determine if my code is optimal is critical-path and bottleneck analysis. After I write the loop, I look up what instructions use which resources. Using that information, I can calculate upper-bound on performance, which I then compare with the actual results to see how close/far I am from optimal.

    For example, suppose I have a loop with 100 adds and 50 multiplies. On both Intel and AMD (pre-Bulldozer), each core can sustain one SSE/AVX add and one SSE/AVX multiply per cycle.
    Since my loop has 100 adds, I know I cannot do any better than 100 cycles. Yes, the multiplier will be idle half the time, but the adder is the bottleneck.

    Now I go and time my loop and I get 105 cycles per iteration. That means I’m pretty close to optimal and there’s not much more to gain. But if I get 250 cycles, then that means something’s wrong with the loop and it’s worth tinkering with it more.

    Critical-path analysis follows the same idea. Look up the latencies for all the instructions and find the cycle time of the critical path of the loop. If your actual performance is very close to it, you’re already optimal.

    Agner Fog has a great reference for the internal details of the current processors:
    http://www.agner.org/optimize/microarchitecture.pdf

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have noticed that sometimes people have to use multiple versions of jQuery in
I have noticed that my application's Document folder is sometimes different. The workflow was
I'm have webpage optimizes for iPad, and I've noticed that sometimes it recognizes my
I have noticed that the immediate window in VS 2010 behaves differently when debugging
We have an application written in .net, c#, winforms. We noticed that sometimes when
When analysing Oracle tkprof trace files I have noticed that there is sometimes a
I have noticed that MATLAB sometimes displays my colors incorrectly. I'm not sure if
I have noticed that in my application, CreateProcessWithTokenW sometimes blocks for a very long
I have noticed with Views in android that sometimes getContext() returns the containing activity,
I have noticed that it is possible to define a custom class and then

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.