Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 993559
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 16, 20262026-05-16T06:26:34+00:00 2026-05-16T06:26:34+00:00

From http://www.boost.org/community/implementation_variations.html … coding differences such as changing a class from virtual to non-virtual

  • 0

From http://www.boost.org/community/implementation_variations.html

“… coding differences such as changing a class from virtual to non-virtual members or removing a level of indirection are unlikely to make any measurable difference unless deep in an inner loop. And even in an inner loop, modern CPUs often execute such competing code sequences in the same number of clock cycles!”

I am trying to understand the “even in the inner loop” part. Specifically what mechanisms do CPUs implement to execute the two codes (virtual vs non-virtual or an additional level of indirection) within the same number of clock cycles? I know about instruction pipelining and caching, but how is it possible to perform a virtual call within the same number of clock cycles as a non-virtual call? How is the indirection “lost”?

  • 1 1 Answer
  • 1 View
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-16T06:26:35+00:00Added an answer on May 16, 2026 at 6:26 am

    Caching (e.g. branch target caching), parallel load units (part of pipelining, but also things like “hit under miss” which don’t stall the pipeline), and out-of-order execution are likely to help transform a load–load–branch into something that is closer to a fixed branch. Instruction folding/elimination (what’s the proper term for this?) in the decode or branch prediction stage of the pipeline may also contribute.

    All of this relies on a lot of different things, though: how many different branch targets there are (e.g. how many different virtual overloads are you likely to trigger), how many things you loop over (is the branch target cache “warm”? how about the icache/dcache?), how the virtual tables or indirection tables are layed out in memory (are they cache-friendly, or is each new vtable load possibly evicting an old vtable?), is the cache being invalidated repeatedly due to multicore ping-ponging, etc…

    (Disclaimer: I’m definitely not an expert here, and a lot of my knowledge comes from studying in-order embedded processors, so some of this is extrapolation. If you have corrections, feel free to comment!)

    The correct way to determine if it’s going to be a problem for a specific program is of course to profile. If you can, do so with the help of hardware counters — they can tell you a lot about what’s going on in the various stages of the pipeline.


    Edit:

    As Hans Passant points out in an above comment Modern CPU Inner Loop Indirection Optimizations, the key to getting these two things to take the same amount of time is the ability to effectively “retire” more than one instruction per cycle. Instruction elimination can help with this, but superscalar design is probably more important (hit under miss is a very small and specific example, fully redundant load units might be a better one).

    Let’s take an ideal situation, and assume a direct branch is just one instruction:

    branch dest
    

    …and an indirect branch is three (maybe you can get it in two, but it’s greater than one):

    load vtable from this
    load dest from vtable
    branch dest
    

    Let’s assume an absolutely perfect situation: *this and the entire vtable are in L1 cache, L1 cache is fast enough to support amortized one cycle per instruction cost for the two loads. (You can even assume the processor reordered the loads and intermixed them with earlier instructions to allow time for them to complete before the branch; it doesn’t matter for this example.) Also assume the branch target cache is hot, and there’s no pipeline flush cost for the branch, and the branch instruction comes down to a single cycle (amortized).

    The theoretical minimum time for the first example is therefore 1 cycle (amortized).

    The theoretical minimum for the second example, absent instruction elimination or redundant functional units or something that will allow retiring more than one instruction per cycle, is 3 cycles (there are 3 instructions)!

    The indirect load will always be slower, because there are more instructions, until you reach into something like superscalar design that allows retiring more than one instruction per cycle.

    Once you have this, the minimum for both examples becomes something between 0 and 1 cycles, again, provided everything else is ideal. Arguably you have to have more ideal circumstances for the second example to actually reach that theoretical minimum than for the first example, but it’s now possible.

    In some of the cases you’d care about, you’re probably not going to reach that minimum for either example. Either the branch target cache will be cold, or the vtable won’t be in the data cache, or the machine won’t be capable of reordering the instructions to take full advantage of the redundant functional units.

    …this is where profiling comes in, which is generally a good idea anyway.

    You can just espouse a slight paranoia about virtuals in the first place. See Noel Llopis’s article on data oriented design, the excellent Pitfalls of Object-Oriented Programming slides, and Mike Acton’s grumpy-yet-educational presentations. Now you’ve suddenly moved into patterns that the CPU is already likely to be happy with, if you’re processing a lot of data.

    High level language features like virtual are usually a tradeoff between expressiveness and control. I honestly think, though, by just increasing your awareness of what virtual is actually doing (don’t be afraid to read the disassembly view from time to time, and definitely peek at your CPU’s architecture manuals), you’ll tend to use it when it makes sense and not when it doesn’t, and a profiler can cover the rest if needed.

    One-size-fits-all statements about “don’t use virtual” or “virtual use is unlikely to make a measurable difference” make me grouchy. The reality is usually more complicated, and either you’re going to be in a situation where you care enough to profile or avoid it, or you’re in that other 95% where it’s probably not worth caring except for the possible educational content.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'am trying to understand the example from program_options of the boost library ( http://www.boost.org/doc/libs/1_38_0/doc/html/program_options/tutorial.html#id3761458
The example I'm trying to compile is from: http://www.boost.org/doc/libs/1_46_1/doc/html/boost_asio/examples.html (the chat example) Here is
I'm trying to build an example of boost::asio http://www.boost.org/doc/libs/1_43_0/doc/html/boost_asio/example/echo/async_tcp_echo_server.cpp but without any luck. System:
I'm trying to compile this little piece of code from the boost documentation: (http://www.boost.org/doc/libs/1_46_1/libs/iostreams/doc/tutorial/filter_usage.html)
the example code on the boost website is not working. http://www.boost.org/doc/libs/1_46_1/libs/filesystem/v3/doc/tutorial.html#Using-path-decomposition int main(int argc,
I'm trying to implement a Tribool type using http://www.boost.org/doc/libs/1_41_0/doc/html/tribool.html as reference. I'm using a
There's example HTTP Client at http://www.boost.org/doc/libs/1_39_0/doc/html/boost_asio/example/http/client/async_client.cpp Please help me to change maximum buffer size
In the official boost link below: http://www.boost.org/doc/libs/1_35_0/doc/html/boost_asio/reference/deadline_timer.html . You can see we can renew
I am looking at the asio example in http://www.boost.org/doc/libs/1_44_0/doc/html/boost_asio/example/timeouts/async_tcp_client.cpp Here's what I am having
I am planning to use boost property tree for our application http://www.boost.org/doc/libs/1_41_0/doc/html/property_tree.html . Now

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.