Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 832745
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 15, 20262026-05-15T04:24:34+00:00 2026-05-15T04:24:34+00:00

I’m now trapped by a performance optimization lab in the book Computer System from

  • 0

I’m now trapped by a performance optimization lab in the book “Computer System from a Programmer’s Perspective” described as following:

In a N*N matrix M, where N is multiple of 32, the rotate operation can be represented as:
Transpose: interchange elements M(i,j) and M(j,i)
Exchange rows: Row i is exchanged with row N-1-i

A example for matrix rotation(N is 3 instead of 32 for simplicity):

-------                          -------
|1|2|3|                          |3|6|9|
-------                          -------
|4|5|6|    after rotate is       |2|5|8|
-------                          -------
|7|8|9|                          |1|4|7|
-------                          -------

A naive implementation is:

#define RIDX(i,j,n) ((i)*(n)+(j))

    void naive_rotate(int dim, pixel *src, pixel *dst) 
    {
        int i, j;

        for (i = 0; i < dim; i++)
        for (j = 0; j < dim; j++)
         dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
    }

I come up with an idea by inner-loop-unroll. The result is:

Code Version          Speed Up
  original                1x
 unrolled by 2           1.33x
 unrolled by 4           1.33x
 unrolled by 8           1.55x
 unrolled by 16          1.67x
 unrolled by 32          1.61x

I also get a code snippet from pastebin.com that seems can solve this problem:

void rotate(int dim, pixel *src, pixel *dst) 
{
    int stride = 32;
    int count = dim >> 5;
    src += dim - 1; 
    int a1 = count;
    do {
        int a2 = dim;
        do {
            int a3 = stride;
            do {
                *dst++ = *src;
                src += dim;
            } while(--a3);
            src -= dim * stride + 1;
            dst += dim - stride;
        } while(--a2);
        src += dim * (stride + 1);
        dst -= dim * dim - stride;
    } while(--a1);
}

After carefully read the code, I think main idea of this solution is treat 32 rows as a data zone, and perform the rotating operation respectively. Speed up of this version is 1.85x, overwhelming all the loop-unroll version.

Here are the questions:

  1. In the inner-loop-unroll version, why does increment slow down if the unrolling factor increase, especially change the unrolling factor from 8 to 16, which does not effect the same when switch from 4 to 8? Does the result have some relationship with depth of the CPU pipeline? If the answer is yes, could the degrade of increment reflect pipeline length?

  2. What is the probable reason for the optimization of data-zone version? It seems that there is no too much essential difference from the original naive version.

EDIT:

My test environment is Intel Centrino Duo architecture and the verion of gcc is 4.4

Any advice will be highly appreciated!

Kind regards!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-15T04:24:35+00:00Added an answer on May 15, 2026 at 4:24 am

    What kind of processor are you testing this on? I dimly remember that unrolling loops helps when the processor can handle multiple operations at once, but only up to the maximum number of parallel executions. So if your processor can only handle 8 simultaneous instructions, then unrolling to 16 won’t help. But someone with knowledge of more recent processor design will have to pipe up/correct me.

    EDIT: According to this PDF, the centrino core2 duo has two processors, each of which is capable of 4 simultaneous instructions. It’s generally not so simple, though. Unless your compiler is optimizing across both cores (ie, when you run the task manager (if you’re on windows, top if you’re on linux), you’ll see that CPU usage is maxed out), your process will be running on one core at a time. The processor also features 14 stages of execution, so if you can keep the pipeline full, you’ll get a faster execution.

    Continuing along the theoretical route, then, you get a speed improvement of 33% with a single unroll because you’re starting to take advantage of simultaneous instruction execution. Going to 4 unrolls doesn’t really help, because you’re now still within that 4-simultaneous-instruction limit. Going to 8 unrolls helps because the processor can now fill the pipeline more completely, so more instructions will get executed per clock cycle.

    For this last, think about how a McDonald’s drive through works (I think that that’s relatively widespread?). A car enters the drivethrough, orders at one window, pays at a second window, and receives food at a third window. If a second drive enters when the first is still ordering, then by the time both finish (assuming each operation in the drive through takes one ‘cycle’ or time unit), then 2 full operations will be done by the time 4 cycles have elapsed. If each car did all of their operations at one window, then the first car would take 3 cycles for ordering, paying, and getting food, and then the second car would also take 3 cycles for ordering, paying and getting food, for a total of 6 cycles. So, operation time due to pipelining decreases.

    Of course, you have to keep the pipeline full to get the largest speed improvement. 14 stages is a lot of stages, so going to 16 unrolls will give you some improvement still because more operations can be in the pipeline.

    Going to 32 causing a decrease in performance may have to do with bandwidth to the processor from the cache (again a guess, can’t know for sure without seeing your code exactly, as well as the machine code). If all the instructions can’t fit into cache or into the registers, then there is some time necessary to prepare them all to run (ie, people have to get into their cars and get to the drive through in the first place). There will be some reduction in speed if they all get there all at once, and some shuffling of the line has to be done to make the operation proceed.

    Note that each movement from src to dst is not free or a single operation. You have the lookups into the arrays, and that costs time.

    As for why the second version works so quickly, I’m going to hazard a guess that it has to do with the [] operator. Every time that gets called, you’re doing some lookups into both the src and dst arrays, resolving pointers to locations, and then retrieving the memory. The other code is going straight to the pointers of the arrays and accessing them directly; basically, for each of the movements from src to dst, there are less operations involved in the move, because the lookups have been handled explicitly through pointer placement. If you use [], these steps are followed:

    • do any math inside the []
    • take a pointer to that location (startOfArray + [] in memory)
    • return the result of that location in memory

    If you walk along with a pointer, you just do the math to do the walk (typically just an addition, no multiplication) and then return the result, because you’ve already done the second step.

    If I’m right, then you might get better results with the second code by unrolling its inner loop as well, so that multiple operations can be pipelined simultaneously.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Ask A Question

Stats

  • Questions 435k
  • Answers 435k
  • Best Answers 0
  • User 1
  • Popular
  • Answers
  • Editorial Team

    How to approach applying for a job at a company ...

    • 7 Answers
  • Editorial Team

    What is a programmer’s life like?

    • 5 Answers
  • Editorial Team

    How to handle personal stress caused by utterly incompetent and ...

    • 5 Answers
  • Editorial Team
    Editorial Team added an answer No, you can apply authorization to controller's actions only. You… May 15, 2026 at 3:38 pm
  • Editorial Team
    Editorial Team added an answer What I have used in the past is to authenticate… May 15, 2026 at 3:38 pm
  • Editorial Team
    Editorial Team added an answer Your markup is really ugly for starters, why have extra… May 15, 2026 at 3:38 pm

Trending Tags

analytics british company computer developers django employee employer english facebook french google interview javascript language life php programmer programs salary

Top Members

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.