Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 198881
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 11, 20262026-05-11T16:58:32+00:00 2026-05-11T16:58:32+00:00

Is there a faster way to rotate a large bitmap by 90 or 270

  • 0

Is there a faster way to rotate a large bitmap by 90 or 270 degrees than simply doing a nested loop with inverted coordinates?

The bitmaps are 8bpp and typically 2048x2400x8bpp

Currently I do this by simply copying with argument inversion, roughly (pseudo code:

for x = 0 to 2048-1
  for y = 0 to 2048-1
    dest[x][y]=src[y][x];

(In reality I do it with pointers, for a bit more speed, but that is roughly the same magnitude)

GDI is quite slow with large images, and GPU load/store times for textures (GF7 cards) are in the same magnitude as the current CPU time.

Any tips, pointers? An in-place algorithm would even be better, but speed is more important than being in-place.

Target is Delphi, but it is more an algorithmic question. SSE(2) vectorization no problem, it is a big enough problem for me to code it in assembler


Follow up to Nils’ answer

  • Image 2048×2700 -> 2700×2048
  • Compiler Turbo Explorer 2006 with optimization on.
  • Windows: Power scheme set to "Always on". (important!!!!)
  • Machine: Core2 6600 (2.4 GHz)

time with old routine: 32ms (step 1)

time with stepsize 8 : 12ms

time with stepsize 16 : 10ms

time with stepsize 32+ : 9ms

Meanwhile I also tested on a Athlon 64 X2 (5200+ iirc), and the speed up there was slightly more than a factor four (80 to 19 ms).

The speed up is well worth it, thanks. Maybe that during the summer months I’ll torture myself with a SSE(2) version. However I already thought about how to tackle that, and I think I’ll run out of SSE2 registers for an straight implementation:

for n:=0 to 7 do
  begin
    load r0, <source+n*rowsize> 
    shift byte from r0 into r1
    shift byte from r0 into r2
    ..
    shift byte from r0 into r8
  end; 
store r1, <target>   
store r2, <target+1*<rowsize>
..
store r8, <target+7*<rowsize>   

So 8×8 needs 9 registers, but 32-bits SSE only has 8. Anyway that is something for the summer months 🙂

Note that the pointer thing is something that I do out of instinct, but it could be there is actually something to it, if your dimensions are not hardcoded, the compiler can’t turn the mul into a shift. While muls an sich are cheap nowadays, they also generate more register pressure afaik.

The code (validated by subtracting result from the "naieve" rotate1 implementation):

const stepsize = 32;
procedure rotatealign(Source: tbw8image; Target:tbw8image);

var stepsx,stepsy,restx,resty : Integer;
   RowPitchSource, RowPitchTarget : Integer;
   pSource, pTarget,ps1,ps2 : pchar;
   x,y,i,j: integer;
   rpstep : integer;
begin
  RowPitchSource := source.RowPitch;          // bytes to jump to next line. Can be negative (includes alignment)
  RowPitchTarget := target.RowPitch;        rpstep:=RowPitchTarget*stepsize;
  stepsx:=source.ImageWidth div stepsize;
  stepsy:=source.ImageHeight div stepsize;
  // check if mod 16=0 here for both dimensions, if so -> SSE2.
  for y := 0 to stepsy - 1 do
    begin
      psource:=source.GetImagePointer(0,y*stepsize);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(target.imagewidth-(y+1)*stepsize,0);
      for x := 0 to stepsx - 1 do
        begin
          for i := 0 to stepsize - 1 do
            begin
              ps1:=@psource[rowpitchsource*i];   // ( 0,i)
              ps2:=@ptarget[stepsize-1-i];       //  (maxx-i,0);
              for j := 0 to stepsize - 1 do
               begin
                 ps2[0]:=ps1[j];
                 inc(ps2,RowPitchTarget);
               end;
            end;
          inc(psource,stepsize);
          inc(ptarget,rpstep);
        end;
    end;
  // 3 more areas to do, with dimensions
  // - stepsy*stepsize * restx        // right most column of restx width
  // - stepsx*stepsize * resty        // bottom row with resty height
  // - restx*resty                    // bottom-right rectangle.
  restx:=source.ImageWidth mod stepsize;   // typically zero because width is 
                                          // typically 1024 or 2048
  resty:=source.Imageheight mod stepsize;
  if restx>0 then
    begin
      // one loop less, since we know this fits in one line of  "blocks"
      psource:=source.GetImagePointer(source.ImageWidth-restx,0);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(Target.imagewidth-stepsize,Target.imageheight-restx);
      for y := 0 to stepsy - 1 do
        begin
          for i := 0 to stepsize - 1 do
            begin
              ps1:=@psource[rowpitchsource*i];   // ( 0,i)
              ps2:=@ptarget[stepsize-1-i];       //  (maxx-i,0);
              for j := 0 to restx - 1 do
               begin
                 ps2[0]:=ps1[j];
                 inc(ps2,RowPitchTarget);
               end;
            end;
         inc(psource,stepsize*RowPitchSource);
         dec(ptarget,stepsize);
       end;
    end;
  if resty>0 then
    begin
      // one loop less, since we know this fits in one line of  "blocks"
      psource:=source.GetImagePointer(0,source.ImageHeight-resty);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(0,0);
      for x := 0 to stepsx - 1 do
        begin
          for i := 0 to resty- 1 do
            begin
              ps1:=@psource[rowpitchsource*i];   // ( 0,i)
              ps2:=@ptarget[resty-1-i];       //  (maxx-i,0);
              for j := 0 to stepsize - 1 do
               begin
                 ps2[0]:=ps1[j];
                 inc(ps2,RowPitchTarget);
               end;
            end;
         inc(psource,stepsize);
         inc(ptarget,rpstep);
       end;
    end;
 if (resty>0) and (restx>0) then
    begin
      // another loop less, since only one block
      psource:=source.GetImagePointer(source.ImageWidth-restx,source.ImageHeight-resty);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(0,target.ImageHeight-restx);
      for i := 0 to resty- 1 do
        begin
          ps1:=@psource[rowpitchsource*i];   // ( 0,i)
          ps2:=@ptarget[resty-1-i];       //  (maxx-i,0);
          for j := 0 to restx - 1 do
            begin
              ps2[0]:=ps1[j];
              inc(ps2,RowPitchTarget);
            end;
       end;
    end;
end;

Update 2 Generics

I tried to update this code to a generics version in Delphi XE. I failed because of QC 99703, and forum people have already confirmed it also exists in XE2. Please vote for it 🙂

Update 3 Generics
Works now in XE10

Update 4

In 2017 i did some work on a assembler version for 8×8 cubes of 8bpp images only and related SO question about shuffle bottlenecks where Peter Cordes generously helped me out. This code still has a missed oportunity and still needs another looptiling level again to aggregate multiple 8×8 block iterations into pseudo larger ones like 64×64. Now it is whole lines again and that is wasteful.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-11T16:58:33+00:00Added an answer on May 11, 2026 at 4:58 pm

    Yes, there are faster ways to do this.

    Your simple loop spends most of the time in cache misses. This happends because you touch a lot of data at very different places in a tight loop. Even worse: Your memory locations are exactly a power of two apart. That’s a size where the cache performs worst.

    You can improve this rotation algorithm if you improve the locality of your memory accesses.

    A simple way to do this would be to rotate each 8×8 pixel block on it’s own using the same code you’ve used for your whole bitmap, and wrap another loop that splits the image rotation into chunks of 8×8 pixels each.

    E.g. something like this (not checked, and sorry for the C-code. My Delphi skills aren’t up to date):

     // this is the outer-loop that breaks your image rotation
     // into chunks of 8x8 pixels each:
     for (int block_x = 0; block_x < 2048; block_x+=8)
     {
        for (int block_y = 0; blocky_y < 2048; block_y+=8)
        { 
           // this is the inner-loop that processes a block
           // of 8x8 pixels.
           for (int x= 0; x<8; x++)
             for (int y=0; y<8; y++)
                dest[x+block_x][y+block_y] = src[y+block_y][x+block_x]
        }
     } 
    

    There are other ways as well. You could process the data in Hilbert-Order or Morton-Order. That would be in theory even a bit faster, but the code will be much more complex.

    Btw – Since you’ve mentioned that SSE is an option for you. Note that you can rotate a 8×8 byte block within the SSE-registers. It’s a bit tricky to get it working, but looking at SSE matrix transpose code should get you started as it’s the same thing.


    EDIT:

    Just checked:

    With a block-size of 8×8 pixels the code runs ca. 5 times faster on my machine. With a block-size of 16×16 it runs 10 times faster.

    Seems like it’s a good idea to experiment with different block-sizes.

    Here is the (very simple) test-program I’ve used:

    #include <stdio.h>
    #include <windows.h>
    
    char temp1[2048*2048];
    char temp2[2048*2048];
    
    void rotate1 (void)
    {
      int x,y;
      for (y=0; y<2048; y++)
      for (x=0; x<2048; x++)
        temp2[2048*y+x] = temp1[2048*x+y];
    }
    
    void rotate2 (void)
    {
      int x,y;
      int bx, by;
    
      for (by=0; by<2048; by+=8)
      for (bx=0; bx<2048; bx+=8)
      for (y=0; y<8; y++)
      for (x=0; x<8; x++)
        temp2[2048*(y+by)+x+bx] = temp1[2048*(x+bx)+y+by];
    }
    
    void rotate3 (void)
    {
      int x,y;
      int bx, by;
    
      for (by=0; by<2048; by+=16)
      for (bx=0; bx<2048; bx+=16)
      for (y=0; y<16; y++)
      for (x=0; x<16; x++)
        temp2[2048*(y+by)+x+bx] = temp1[2048*(x+bx)+y+by];
    }
    
    
    int main (int argc, char **args)
    {
      int i, t1;
    
      t1 = GetTickCount();
      for (i=0; i<20; i++) rotate1();
      printf ("%d\n", GetTickCount()-t1);
    
      t1 = GetTickCount();
      for (i=0; i<20; i++) rotate2();
      printf ("%d\n", GetTickCount()-t1);
    
      t1 = GetTickCount();
      for (i=0; i<20; i++) rotate3();
      printf ("%d\n", GetTickCount()-t1);
    
    }
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Is there a faster way than calling moveToNext in a loop?
Is there a better, more elegant (and/or possibly faster) way than boolean isNumber =
Is there a faster way then to simply catch an exception like below? try
is there any faster way to parse a text than by walk each byte
Is there any faster way to achieve padding of pixels to a BufferedImage than
Is there a faster way to remove a directory then simply submitting rm -r
Is there a faster way to set the vertex buffer data than SetVertexBuffer ?
Is there any faster way to access the frame buffer than using glReadPixels? I
okay not sure if there is a better way than I'm already doing. Right
Is there a faster way than BitConverter.ToInt32 to convert a byte array to an

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.