Years ago I was learning about x86 assembler, CPU pipelining, cache misses, branch prediction, and all that jazz.
It was a tale of two halves. I read about all the wonderful advantages of the lengthy pipelines in the processor viz instruction reordering, cache preloading, dependency interleaving, etc.
The downside was that any deviation for the norm was enormously costly. For example, IIRC a certain AMD processor in the early-gigahertz era had a 40 cycle penalty every time you called a function through a pointer (!) and this was apparently normal.
This is not a negligible “don’t worry about it” number! Bear in mind that “good design” normally means “factor your functions as much as possible” and “encode semantics in the data types” which often implies virtual interfaces.
The trade-off is that code which doesn’t perform such operations might get more than two instructions per cycle. These are numbers one wants to worry about when writing high-performance C++ code which is heavy on the object design and light on the number crunching.
I understand that the long-CPU-pipeline trend has been reversing as we enter the low-power era. Here’s my question:
Does the latest generation of x86-compatible processors still suffer massive penalties for virtual function calls, bad branch predictions, etc?
Huh.. so large..
There is an "Indirect branch prediction" method, which helps to predict virtual function jump, IF there was the same indirect jump some time ago. There is still a penalty for first and mispredicted virt. function jump.
Support varies from simple "predicted right if and only if the previous indirect branch was exactly the same" to very complex two-level tens or hundreds entries with detecting of periodic alternation of 2-3 target address for single indirect jmp instruction.
There was a lot of evolution here…
http://arstechnica.com/hardware/news/2006/04/core.ars/7
http://www.realworldtech.com/page.cfm?ArticleID=rwt051607033728&p=3
http://www.realworldtech.com/page.cfm?ArticleID=RWT102808015436&p=5
http://www.agner.org/optimize/microarchitecture.pdf
and the same pdf, page 14
Agner’s manual has a longer description of branch predictor in many modern CPUs and the evolution of predictor in cpus of every manufacturer (x86/x86_64).
Also a lot of theoretical "indirect branch prediction" methods (look in the Google scholar); even wiki said some words about it http://en.wikipedia.org/wiki/Branch_predictor#Prediction_of_indirect_jumps /
For Atoms from the agner’s micro:
So, for low power, indirect branch prediction is not so advanced. So does Via Nano:
I think, that shorter pipeline of lowpower x86 has lower penalty, 7-20 ticks.