I have the following two files :- single.cpp :- #include <iostream> #include <stdlib.h> using

Question

0

Editorial Team

Asked: June 3, 20262026-06-03T10:34:19+00:00 2026-06-03T10:34:19+00:00

I have the following two files :- single.cpp :- #include <iostream> #include <stdlib.h> using

0

I have the following two files :-

single.cpp :-

#include <iostream>
#include <stdlib.h>

using namespace std;

unsigned long a=0;

class A {
  public:
    virtual int f() __attribute__ ((noinline)) { return a; } 
};

class B : public A {                                                                              
  public:                                                                                                                                                                        
    virtual int f() __attribute__ ((noinline)) { return a; }                                      
    void g() __attribute__ ((noinline)) { return; }                                               
};                                                                                                

int main() {                                                                                      
  cin>>a;                                                                                         
  A* obj;                                                                                         
  if (a>3)                                                                                        
    obj = new B();
  else
    obj = new A();                                                                                

  unsigned long result=0;                                                                         

  for (int i=0; i<65535; i++) {                                                                   
    for (int j=0; j<65535; j++) {                                                                 
      result+=obj->f();                                                                           
    }                                                                                             
  }                                                                                               

  cout<<result<<"\n";                                                                             
}

And

multiple.cpp :-

#include <iostream>
#include <stdlib.h>

using namespace std;

unsigned long a=0;

class A {
  public:
    virtual int f() __attribute__ ((noinline)) { return a; }
};

class dummy {
  public:
    virtual void g() __attribute__ ((noinline)) { return; }
};

class B : public A, public dummy {
  public:
    virtual int f() __attribute__ ((noinline)) { return a; }
    virtual void g() __attribute__ ((noinline)) { return; }
};


int main() {
  cin>>a;
  A* obj;
  if (a>3)
    obj = new B();
  else
    obj = new A();

  unsigned long result=0;

  for (int i=0; i<65535; i++) {
    for (int j=0; j<65535; j++) {
      result+=obj->f();
    }
  }

  cout<<result<<"\n";
}

I am using gcc version 3.4.6 with flags -O2

And this is the timings results I get :-

multiple :-

real    0m8.635s
user    0m8.608s
sys 0m0.003s

single :-

real    0m10.072s
user    0m10.045s
sys 0m0.001s

On the other hand, if in multiple.cpp I invert the order of class derivation thus :-

class B : public dummy, public A {

Then I get the following timings (which is slightly slower than that for single inheritance as one might expect thanks to ‘thunk’ adjustments to the this pointer that the code would need to do) :-

real    0m11.516s
user    0m11.479s
sys 0m0.002s

Any idea why this may be happening? There doesn’t seem to be any difference in the assembly generated for all three cases as far as the loop is concerned. Is there some other place that I need to look at?

Also, I have bound the process to a specific cpu core and I am running it on a real-time priority with SCHED_RR.

EDIT:- This was noticed by Mysticial and reproduced by me.
Doing a

cout << "vtable: " << *(void**)obj << endl;

just before the loop in single.cpp leads to single also being as fast as multiple clocking in at 8.4 s just like public A, public dummy.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-03T10:34:21+00:00

I think I got at least some further lead on why this may be happening. The assembly for the loops is exactly identical but the object files are not!

For the loop with the cout at first (i.e.)

cout << "vtable: " << *(void**)obj << endl;

for (int i=0; i<65535; i++) {
  for (int j=0; j<65535; j++) {
    result+=obj->f();
  }
}

I get the following in the object file :-

40092d:       bb fe ff 00 00          mov    $0xfffe,%ebx                                       
400932:       48 8b 45 00             mov    0x0(%rbp),%rax                                     
400936:       48 89 ef                mov    %rbp,%rdi                                          
400939:       ff 10                   callq  *(%rax)                                            
40093b:       48 98                   cltq                                                      
40093d:       49 01 c4                add    %rax,%r12                                          
400940:       ff cb                   dec    %ebx                                               
400942:       79 ee                   jns    400932 <main+0x42>                                 
400944:       41 ff c5                inc    %r13d                                              
400947:       41 81 fd fe ff 00 00    cmp    $0xfffe,%r13d                                      
40094e:       7e dd                   jle    40092d <main+0x3d>

However, without the cout, the loops become :- (.cpp first)

for (int i=0; i<65535; i++) {
  for (int j=0; j<65535; j++) {
    result+=obj->f();
  }
}

Now, .obj :-

400a54:       bb fe ff 00 00          mov    $0xfffe,%ebx
400a59:       66                      data16                                                    
400a5a:       66                      data16 
400a5b:       66                      data16                                                    
400a5c:       90                      nop                                                       
400a5d:       66                      data16                                                    
400a5e:       66                      data16                                                    
400a5f:       90                      nop                                                       
400a60:       48 8b 45 00             mov    0x0(%rbp),%rax                                     
400a64:       48 89 ef                mov    %rbp,%rdi                                          
400a67:       ff 10                   callq  *(%rax)
400a69:       48 98                   cltq   
400a6b:       49 01 c4                add    %rax,%r12                                          
400a6e:       ff cb                   dec    %ebx                                               
400a70:       79 ee                   jns    400a60 <main+0x70>                                 
400a72:       41 ff c5                inc    %r13d                                              
400a75:       41 81 fd fe ff 00 00    cmp    $0xfffe,%r13d
400a7c:       7e d6                   jle    400a54 <main+0x64>

So I’d have to say it’s not really due to false aliasing as Mysticial points out but simply due to these NOPs that the compiler/linker is emitting.

The assembly in both cases is :-

.L30:
        movl    $65534, %ebx
        .p2align 4,,7                   
.L29:
        movq    (%rbp), %rax            
        movq    %rbp, %rdi
        call    *(%rax)
        cltq    
        addq    %rax, %r12                                                                        
        decl    %ebx
        jns     .L29
        incl    %r13d 
        cmpl    $65534, %r13d
        jle     .L30

Now, .p2align 4,,7 will insert data/NOPs until the instruction counter for the next instruction has the last four bits 0’s for a maximum of 7 NOPs. Now the address of the instruction just after p2align in the case without cout and before padding would be

0x400a59 = 0b101001011001

And since it takes <=7 NOPs to align the next instruction, it will in fact do so in the object file.

On the other hand, for the case with the cout, the instruction just after .p2align lands up at

0x400932 = 0b100100110010

and it would take > 7 NOPs to pad it to a divisible by 16 boundary. Hence, it doesn’t do that.

So the extra time taken is simply due to the NOPs that the compiler pads the code with (for better cache alignment) when compiling with the -O2 flag and not really due to false aliasing.

I think this resolves the issue. I am using http://sourceware.org/binutils/docs/as/P2align.html
as my reference for what .p2align actually does.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have the following two files :- single.cpp :- #include <iostream> #include <stdlib.h> using

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply