Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7732863
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 1, 20262026-06-01T06:50:18+00:00 2026-06-01T06:50:18+00:00

I don’t know how to optimize cache performance at a really low level, thinking

  • 0

I don’t know how to optimize cache performance at a really low level, thinking about cache-line size or associativity. That’s not something you can learn overnight. Considering my program will run on many different systems and architectures, I don’t think it would be worth it anyway. But still, there are probably some steps I can take to reduce cache misses in general.

Here is a description of my problem:

I have a 3d array of integers, representing values at points in space, like [x][y][z]. Each dimension is the same size, so it’s like a cube. From that I need to make another 3d array, where each value in this new array is a function of 7 parameters: the corresponding value in the original 3d array, plus the 6 indices that “touch” it in space. I’m not worried about the edges and corners of the cube for now.

Here is what I mean in C++ code:

void process3DArray (int input[LENGTH][LENGTH][LENGTH], 
                     int output[LENGTH][LENGTH][LENGTH])
{
    for(int i = 1; i < LENGTH-1; i++)
        for (int j = 1; j < LENGTH-1; j++)
            for (int k = 1; k < LENGTH-1; k++)
            //The for loops start at 1 and stop before LENGTH-1
            //or other-wise I'll get out-of-bounds errors
            //I'm not concerned with the edges and corners of the 
            //3d array "cube" at the moment.
            {
                int value = input[i][j][k];

                //I am expecting crazy cache misses here:
                int posX = input[i+1] [j]   [k];
                int negX = input[i-1] [j]   [k];
                int posY = input[i]   [j+1] [k];
                int negY = input[i]   [j-1] [k];
                int posZ = input[i]   [j]   [k+1];
                int negZ = input[i]   [j]   [k-1];

                output [i][j][k] = 
                    process(value, posX, negX, posY, negY, posZ, negZ);
            }
}

However, it seems like if LENGTH is large enough, I’ll get tons of cache misses when I’m fetching the parameters for process. Is there a cache-friendlier way to do this, or a better way to represent my data other than a 3d array?

And if you have the time to answer these extra questions, do I have to consider the value of LENGTH? Like it’s different whether LENGTH is 20 vs 100 vs 10000. Also, would I have to do something else if I used something other than integers, like maybe a 64-byte struct?

@ ildjarn:

Sorry, I did not think that the code that generates the arrays I am passing into process3DArray mattered. But if it does, I would like to know why.

int main() {
    int data[LENGTH][LENGTH][LENGTH];
    for(int i = 0; i < LENGTH; i++)
        for (int j = 0; j < LENGTH; j++)
            for (int k = 0; k < LENGTH; k++)
                data[i][j][k] = rand() * (i + j + k);

    int result[LENGTH][LENGTH][LENGTH];
    process3DArray(data, result);
}
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-01T06:50:19+00:00Added an answer on June 1, 2026 at 6:50 am

    The most important thing you already have right. If you were using Fortran, you’d be doing it exactly wrong, but that’s another story. What you have right is that you are processing in the inner loop along the direction where memory addresses are closest together. A single memory fetch (beyond the cache) will pull in multiple values, corresponding to a series of adjacent values of k. Inside your loop the cache will contain some number of values from i,j; a similar number from i+/-1, j and from i,j+/-1. So you basically have five disjoint sections of memory active. For small values of LENGTH these will only be 1 or three sections of memory. It is in the nature of how caches are built that you can have more than this many disjoint sections of memory in your active set.

    I hope process() is small, and inline. Otherwise this may well be insignificant. Also, it will affect whether your code fits in the instruction cache.

    Since you’re interested in performance, it is almost always better to initialize five pointers (you only need one for value, posZ and negZ), and then take *(p++) inside the loop.

    input[i+1] [j]   [k];
    

    is asking the compiler to generate 3 adds and two multiplies, unless you have a very good optimizer. If your compiler is particularly lazy about register allocation, you also get four memory accesses; otherwise one.

    *inputIplusOneJK++ 
    

    is asking for one add and a memory reference.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Don't really know how to formulate the title, but it should be pretty obvious
Don't be afraid to use any technical jargon or low-level explanations for things, please.
Don't know a whole lot about streams. Why does the first version work using
Don't know much about encryption... Say I'm preparing a SAML request to submit to
Don't know much about running a function on every item in an array, still
Don't need to do this right now but thinking about the future... What would
Don't know if there is a better way to do this, so that is
Don't let below code scare you away . The question is really simple, only
Don't know how to google for such, but is there a way to query
Don't know why but font is not displaying.Please help. CSS(in css folder): style.css: @font-face

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.