Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8197133
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 7, 20262026-06-07T05:40:43+00:00 2026-06-07T05:40:43+00:00

After conducting some experiments on square matrices of different sizes, a pattern came up.

  • 0

After conducting some experiments on square matrices of different sizes, a pattern came up. Invariably, transposing a matrix of size 2^n is slower than transposing one of size 2^n+1. For small values of n, the difference is not major.

Big differences occur however over a value of 512. (at least for me)

Disclaimer: I know the function doesn’t actually transpose the matrix because of the double swap of elements, but it makes no difference.

Follows the code:

#define SAMPLES 1000
#define MATSIZE 512

#include <time.h>
#include <iostream>
int mat[MATSIZE][MATSIZE];

void transpose()
{
   for ( int i = 0 ; i < MATSIZE ; i++ )
   for ( int j = 0 ; j < MATSIZE ; j++ )
   {
       int aux = mat[i][j];
       mat[i][j] = mat[j][i];
       mat[j][i] = aux;
   }
}

int main()
{
   //initialize matrix
   for ( int i = 0 ; i < MATSIZE ; i++ )
   for ( int j = 0 ; j < MATSIZE ; j++ )
       mat[i][j] = i+j;

   int t = clock();
   for ( int i = 0 ; i < SAMPLES ; i++ )
       transpose();
   int elapsed = clock() - t;

   std::cout << "Average for a matrix of " << MATSIZE << ": " << elapsed / SAMPLES;
}

Changing MATSIZE lets us alter the size (duh!). I posted two versions on ideone:

  • size 512 – average 2.46 ms – http://ideone.com/1PV7m
  • size 513 – average 0.75 ms – http://ideone.com/NShpo

In my environment (MSVS 2010, full optimizations), the difference is similar :

  • size 512 – average 2.19 ms
  • size 513 – average 0.57 ms

Why is this happening?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-07T05:40:44+00:00Added an answer on June 7, 2026 at 5:40 am

    The explanation comes from Agner Fog in Optimizing software in C++ and it reduces to how data is accessed and stored in the cache.

    For terms and detailed info, see the wiki entry on caching, I’m gonna narrow it down here.

    A cache is organized in sets and lines. At a time, only one set is used, out of which any of the lines it contains can be used. The memory a line can mirror times the number of lines gives us the cache size.

    For a particular memory address, we can calculate which set should mirror it with the formula:

    set = ( address / lineSize ) % numberOfsets
    

    This sort of formula ideally gives a uniform distribution across the sets, because each memory address is as likely to be read (I said ideally).

    It’s clear that overlaps can occur. In case of a cache miss, the memory is read in the cache and the old value is replaced. Remember each set has a number of lines, out of which the least recently used one is overwritten with the newly read memory.

    I’ll try to somewhat follow the example from Agner:

    Assume each set has 4 lines, each holding 64 bytes. We first attempt to read the address 0x2710, which goes in set 28. And then we also attempt to read addresses 0x2F00, 0x3700, 0x3F00 and 0x4700. All of these belong to the same set. Before reading 0x4700, all lines in the set would have been occupied. Reading that memory evicts an existing line in the set, the line that initially was holding 0x2710. The problem lies in the fact that we read addresses that are (for this example) 0x800 apart. This is the critical stride (again, for this example).

    The critical stride can also be calculated:

    criticalStride = numberOfSets * lineSize
    

    Variables spaced criticalStride or a multiple apart contend for the same cache lines.

    This is the theory part. Next, the explanation (also Agner, I’m following it closely to avoid making mistakes):

    Assume a matrix of 64×64 (remember, the effects vary according to the cache) with an 8kb cache, 4 lines per set * line size of 64 bytes. Each line can hold 8 of the elements in the matrix (64-bit int).

    The critical stride would be 2048 bytes, which correspond to 4 rows of the matrix (which is continuous in memory).

    Assume we’re processing row 28. We’re attempting to take the elements of this row and swap them with the elements from column 28. The first 8 elements of the row make up a cache line, but they’ll go into 8 different cache lines in column 28. Remember, critical stride is 4 rows apart (4 consecutive elements in a column).

    When element 16 is reached in the column (4 cache lines per set & 4 rows apart = trouble) the ex-0 element will be evicted from the cache. When we reach the end of the column, all previous cache lines would have been lost and needed reloading on access to the next element (the whole line is overwritten).

    Having a size that is not a multiple of the critical stride messes up this perfect scenario for disaster, as we’re no longer dealing with elements that are critical stride apart on the vertical, so the number of cache reloads is severely reduced.

    Another disclaimer – I just got my head around the explanation and hope I nailed it, but I might be mistaken. Anyway, I’m waiting for a response (or confirmation) from Mysticial. 🙂

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

After some time researching and trying different things I still cannot get my @ExceptionHandler
After a long search I'm still confused about it although I found some related
I read this blog: After conducting a bit of research I discovered that this
I am getting some strange character displays on hyper terminal, after establishing communication with
Summary: I'm having some svn trouble, and after debugging it for a while, I'm
After connecting to server, I run some commands on server and then trying to
I am working on this code but even after consulting other threads I am
I have a class that gathers templates and displays the final output after connecting
I've been developing with kinect XBOX 360 for a while. Today, after connecting a
After deploying WCF server (svc) on my Server, I have got this message when

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.