I have two ways to program the same functionality. Method 1: doTheWork(int action) {

Question

0

Editorial Team

Asked: May 16, 20262026-05-16T04:49:36+00:00 2026-05-16T04:49:36+00:00

I have two ways to program the same functionality. Method 1: doTheWork(int action) {

0

I have two ways to program the same functionality.

Method 1:

doTheWork(int action)
{
    for(int i = 0 i < 1000000000; ++i)
    {
        doAction(action);
    }
}

Method 2:

doTheWork(int action)
{
    switch(action)
    {
    case 1:
        for(int i = 0 i < 1000000000; ++i)
        {
            doAction<1>();
        }
        break;
    case 2:
        for(int i = 0 i < 1000000000; ++i)
        {
            doAction<2>();
        }
        break;
    //-----------------------------------------------
    //... (there are 1000000 cases here)
    //-----------------------------------------------
    case 1000000:
        for(int i = 0 i < 1000000000; ++i)
        {
            doAction<1000000>();
        }
        break;
    }
}

Let’s assume that the function doAction(int action) and the function template<int Action> doAction() consist of about 10 lines of code that will get inlined at compile-time. Calling doAction(#) is equiavalent to doAction<#>() in functionality, but the non-templated doAction(int value) is somewhat slower than template<int Value> doAction(), since some nice optimizations can be done in the code when the argument value is known at compile time.

So my question is, do all the millions of lines of code fill the CPU L1 cache (and more) in the case of the templated function (and thus degrade performance considerably), or does only the lines of doAction<#>() inside the loop currently being run get cached?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-16T04:49:37+00:00

It depends on the actual code size – 10 lines of code can be little or much – and of course on the actual machine.

However, Method 2 violently violates this decades rule of thumb: instructions are cheap, memory access is not.

Scalability limit

Your optimizations are usually linear – you might shave off 10, 20 maybe even 30% of execution time. Hitting a cache limit is highly nonlinear – as in “running into a brick wall” nonlinear.

As soon as your code size significantly exceeds the 2nd/3rd level cache’s size, Method 2 will lose big time, as the following estimation of a high end consumer system shows:

DDR3-1333 with 10667MB/s peak memory bandwidth,
Intel Core i7 Extreme with ~75000 MIPS

gives you 10667MB / 75000M = 0.14 bytes per instruction for break even – anything larger, and main memory can’t keep up with the CPU.

Typical x86 instruction sizes are 2..3 bytes executing in 1..2 cycles (now, granted, this isn’t necessarily the same instructions, as x86 instructions are split up. Still…)
Typical x64 instruction lengths are even larger.

How much does your cache help?
I found the following number (different source, so it’s hard to compare):
i7 Nehalem L2 cache (256K, >200GB/s bandwidth) which could almost keep up with x86 instructions, but probably not with x64.

In addition, your L2 cache will kick in completely only if

you have perfect prediciton of the next instructions or you don’t have first-run penalty and it fits the cache completely
there’s no significant amount of data being processed
there’s no significant other code in your “inner loop”
there’s no thread executing on this core

Given that, you can lose much earlier, especially on a CPU/board with smaller caches.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have two ways to program the same functionality. Method 1: doTheWork(int action) {

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply