Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 3276042
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 17, 20262026-05-17T19:12:47+00:00 2026-05-17T19:12:47+00:00

I’m running into an inconsistent optimization behavior with different compilers for the following code:

  • 0

I’m running into an inconsistent optimization behavior with different compilers for the following code:

class tester
{
public:
    tester(int* arr_, int sz_)
        : arr(arr_), sz(sz_)
    {}

    int doadd()
    {
        sm = 0;
        for (int n = 0; n < 1000; ++n) 
        {
            for (int i = 0; i < sz; ++i)
            {
                sm += arr[i];
            }
        }
        return sm;
    }
protected:
    int* arr;
    int sz;
    int sm;
};

The doadd function simulates some intensive access to members (ignore the overflows in addition for this question). Compared with similar code implemented as a function:

int arradd(int* arr, int sz)
{
    int sm = 0;
    for (int n = 0; n < 1000; ++n) 
    {
        for (int i = 0; i < sz; ++i)
        {
            sm += arr[i];
        }
    }
    return sm;
}

The doadd method runs about 1.5 times slower than the arradd function when compiled in Release mode with Visual C++ 2008. When I modify the doadd method to be as follows (aliasing all members with locals):

int doadd()
{
    int mysm = 0;
    int* myarr = arr;
    int mysz = sz;
    for (int n = 0; n < 1000; ++n) 
    {
        for (int i = 0; i < mysz; ++i)
        {
            mysm += myarr[i];
        }
    }
    sm = mysm;
    return sm;
}

Runtimes become roughly the same. Am I right in concluding that this is a missing optimization by the Visual C++ compiler? g++ seems to do it better and run both the member function and the normal function at the same speed when compiling with -O2 or -O3.


The benchmarking is done by invoking the doadd member and arradd function on some sufficiently large array (a few millions of integers in size).


EDIT: Some fine-grained testing shows that the main culprit is the sm member. Replacing all others by local versions still makes the runtime long, but once I replace sm by mysm the runtime becomes equal to the function version.


alt text

Resolution

Dissapointed with the answers (sorry guys), I shaked off my laziness and dove into the disassembly listings for this code. My answer below summarizes the findings. In short: it has nothing to do with aliasing, it has all to do with loop unrolling, and with some strange heuristics MSVC applies when deciding which loop to unroll.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-17T19:12:47+00:00Added an answer on May 17, 2026 at 7:12 pm

    I disassembled the code with MSVC to better understand what’s going on. Turns out aliasing wasn’t a problem at all, and neither was some kind of paranoid thread safety.

    Here is the interesting part of the arradd function disassambled:

        for (int n = 0; n < 10; ++n)
        {
            for (int i = 0; i < sz; ++i)
    013C101C  mov         ecx,ebp 
    013C101E  mov         ebx,29B9270h 
            {
                sm += arr[i];
    013C1023  add         eax,dword ptr [ecx-8] 
    013C1026  add         edx,dword ptr [ecx-4] 
    013C1029  add         esi,dword ptr [ecx] 
    013C102B  add         edi,dword ptr [ecx+4] 
    013C102E  add         ecx,10h 
    013C1031  sub         ebx,1 
    013C1034  jne         arradd+23h (13C1023h) 
    013C1036  add         edi,esi 
    013C1038  add         edi,edx 
    013C103A  add         eax,edi 
    013C103C  sub         dword ptr [esp+10h],1 
    013C1041  jne         arradd+16h (13C1016h) 
    013C1043  pop         edi  
    013C1044  pop         esi  
    013C1045  pop         ebp  
    013C1046  pop         ebx  
    

    ecx points to the array, and we can see that the internal loop is unrolled x4 here – note the four consecutive add instructions from following addresses, and ecx being advanced by 16 bytes (4 words) at a time inside the loop.

    For the unoptimized version of the member function, doadd:

    int tester::doadd()
    {
        sm = 0;
        for (int n = 0; n < 10; ++n)
        {
            for (int i = 0; i < sz; ++i)
            {
                sm += arr[i];
            }
        }
        return sm;
    }
    

    The disassembly is (it’s harder to find since the compiler inlined it into main):

        int tr_result = tr.doadd();
    013C114A  xor         edi,edi 
    013C114C  lea         ecx,[edi+0Ah] 
    013C114F  nop              
    013C1150  xor         eax,eax 
    013C1152  add         edi,dword ptr [esi+eax*4] 
    013C1155  inc         eax  
    013C1156  cmp         eax,0A6E49C0h 
    013C115B  jl          main+102h (13C1152h) 
    013C115D  sub         ecx,1 
    013C1160  jne         main+100h (13C1150h) 
    

    Note 2 things:

    • The sum is stored in a register – edi. Hence, there’s not aliasing “care” taken here. The value of sm isn’t re-read all the time. edi isinitialized just once and then used as a temporary. You don’t see its return since the compiler optimized it and used edi directly as the return value of the inlined code.
    • The loop is not unrolled. Why? No good reason.

    Finally, here’s an “optimized” version of the member function, with mysm keeping the sum local manually:

    int tester::doadd_opt()
    {
        sm = 0;
        int mysm = 0;
        for (int n = 0; n < 10; ++n)
        {
            for (int i = 0; i < sz; ++i)
            {
                mysm += arr[i];
            }
        }
        sm = mysm;
        return sm;
    }
    

    The (again, inlined) disassembly is:

        int tr_result_opt = tr_opt.doadd_opt();
    013C11F6  xor         edi,edi 
    013C11F8  lea         ebp,[edi+0Ah] 
    013C11FB  jmp         main+1B0h (13C1200h) 
    013C11FD  lea         ecx,[ecx] 
    013C1200  xor         ecx,ecx 
    013C1202  xor         edx,edx 
    013C1204  xor         eax,eax 
    013C1206  add         ecx,dword ptr [esi+eax*4] 
    013C1209  add         edx,dword ptr [esi+eax*4+4] 
    013C120D  add         eax,2 
    013C1210  cmp         eax,0A6E49BFh 
    013C1215  jl          main+1B6h (13C1206h) 
    013C1217  cmp         eax,0A6E49C0h 
    013C121C  jge         main+1D1h (13C1221h) 
    013C121E  add         edi,dword ptr [esi+eax*4] 
    013C1221  add         ecx,edx 
    013C1223  add         edi,ecx 
    013C1225  sub         ebp,1 
    013C1228  jne         main+1B0h (13C1200h) 
    

    The loop here is unrolled, but just x2.

    This explains my speed-difference observations quite well. For a 175e6 array, the function runs ~1.2 secs, the unoptimized member ~1.5 secs, and the optimized member ~1.3 secs. (Note that this may differ for you, on another machine I got closer runtimes for all 3 versions).

    What about gcc? When compiled with it, all 3 versions ran at ~1.5 secs. Suspecting the lack of unrolling I looked at gcc‘s disassembly and indeed: gcc doesn’t unroll any of the versions.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a bunch of posts stored in text files formatted in yaml/textile (from
I am trying to loop through a bunch of documents I have to put
I'm making a simple page using Google Maps API 3. My first. One marker
I have some data like this: 1 2 3 4 5 9 2 6

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.