Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6390547
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 25, 20262026-05-25T03:33:38+00:00 2026-05-25T03:33:38+00:00

Years ago I was learning about x86 assembler, CPU pipelining, cache misses, branch prediction,

  • 0

Years ago I was learning about x86 assembler, CPU pipelining, cache misses, branch prediction, and all that jazz.

It was a tale of two halves. I read about all the wonderful advantages of the lengthy pipelines in the processor viz instruction reordering, cache preloading, dependency interleaving, etc.

The downside was that any deviation for the norm was enormously costly. For example, IIRC a certain AMD processor in the early-gigahertz era had a 40 cycle penalty every time you called a function through a pointer (!) and this was apparently normal.

This is not a negligible “don’t worry about it” number! Bear in mind that “good design” normally means “factor your functions as much as possible” and “encode semantics in the data types” which often implies virtual interfaces.

The trade-off is that code which doesn’t perform such operations might get more than two instructions per cycle. These are numbers one wants to worry about when writing high-performance C++ code which is heavy on the object design and light on the number crunching.

I understand that the long-CPU-pipeline trend has been reversing as we enter the low-power era. Here’s my question:

Does the latest generation of x86-compatible processors still suffer massive penalties for virtual function calls, bad branch predictions, etc?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-25T03:33:38+00:00Added an answer on May 25, 2026 at 3:33 am

    AMD processor in the early-gigahertz era had a 40 cycle penalty every time you called a function

    Huh.. so large..

    There is an "Indirect branch prediction" method, which helps to predict virtual function jump, IF there was the same indirect jump some time ago. There is still a penalty for first and mispredicted virt. function jump.

    Support varies from simple "predicted right if and only if the previous indirect branch was exactly the same" to very complex two-level tens or hundreds entries with detecting of periodic alternation of 2-3 target address for single indirect jmp instruction.

    There was a lot of evolution here…

    http://arstechnica.com/hardware/news/2006/04/core.ars/7

    first introduced with the Pentium M: … indirect branch predictor.

    The indirect branch predictor

    Because indirect branches load their branch targets from a register, instead of having them immediately available as is the case with direct branches, they’re notoriously difficult to predict. Core’s indirect branch predictor is a table that stores history information about the preferred target addresses of each indirect branch that the front end encounters. Thus when the front-end encounters an indirect branch and predicts it as taken, it can ask the indirect branch predictor to direct it to the address in the BTB that the branch will probably want.

    http://www.realworldtech.com/page.cfm?ArticleID=rwt051607033728&p=3

    Indirect branch prediction was first introduced with Intel’s Prescott microarchitecture and later the Pentium M.

    between 16-50% of all branch mispredicts were indirect (29% on average). The real value of indirect branch misprediction is for many of the newer scripting or high level languages, such as Ruby, Perl or Python, which use interpreters. Other common indirect branch common culprits include virtual functions (used in C++) and calls to function pointers.

    http://www.realworldtech.com/page.cfm?ArticleID=RWT102808015436&p=5

    AMD has adopted some of these refinements; for instance adding indirect branch predictor arrays in Barcelona and later processors. However, the K8 has older and less accurate branch predictors than the Core 2.

    http://www.agner.org/optimize/microarchitecture.pdf

    3.12 Indirect jumps on older processors
    Indirect jumps, indirect calls, and returns may go to a different address each time. The
    prediction method for an indirect jump or indirect call is, in processors older than PM and
    K10, simply to predict that it will go to the same target as last time it was executed.

    and the same pdf, page 14

    Indirect jump prediction
    An indirect jump or call is a control transfer instruction that has more than two possible
    targets. A C++ program can generate an indirect jump or call with… a virtual function. An indirect jump or call is generated in assembly by
    specifying a register or a memory variable or an indexed array as the destination of a jump
    or call instruction. Many processors make only one BTB entry for an indirect jump or call.
    This means that it will always be predicted to go to the same target as it did last time.
    As object oriented programming with polymorphous classes has become more common,
    there is a growing need for predicting indirect calls with multiple targets. This can be done
    by assigning a new BTB entry for every new jump target that is encountered. The history
    buffer and pattern history table must have space for more than one bit of information for
    each jump incident in order to distinguish more than two possible targets.
    The PM is the first x86 processor to implement this method. The prediction rule on p. 12 still
    applies with the modification that the theoretical maximum period that can be predicted
    perfectly is mn, where m is the number of different targets per indirect jump, because there
    are mn different possible n-length subsequences. However, this theoretical maximum cannot
    be reached if it exceeds the size of the BTB or the pattern history table.

    Agner’s manual has a longer description of branch predictor in many modern CPUs and the evolution of predictor in cpus of every manufacturer (x86/x86_64).

    Also a lot of theoretical "indirect branch prediction" methods (look in the Google scholar); even wiki said some words about it http://en.wikipedia.org/wiki/Branch_predictor#Prediction_of_indirect_jumps /

    For Atoms from the agner’s micro:

    Prediction of indirect branches
    The Atom has no pattern predictor for indirect branches according to my tests. Indirect
    branches are predicted to go to the same target as last time.

    So, for low power, indirect branch prediction is not so advanced. So does Via Nano:

    Indirect jumps are predicted to go to the same target as last time.

    I think, that shorter pipeline of lowpower x86 has lower penalty, 7-20 ticks.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I started learning c++ about 3 weeks ago after 2 years of java. It
Some years ago I was told about a study into code reuse. Apparently it
I learned Python as my first serious (non BASIC) language about 10 years ago.
This happened to me a few years ago before I knew about SO, but
I have a contact DB that I created 3 years ago in MS ACCESS.
Few years ago I saw functionality that allows building graphics using TFS that show
What happened to Dlinq? I heard a few years ago that it was abandoned.
Years ago when I was working with C# I could easily create a temporary
Years ago when I was starting a small development project, the other developers and
Years ago I released a program called Banshee Screamer Alarm and at the time

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.