Is there a faster way to rotate a large bitmap by 90 or 270

Question

0

Asked: May 11, 20262026-05-11T16:58:32+00:00 2026-05-11T16:58:32+00:00

Is there a faster way to rotate a large bitmap by 90 or 270

0

Is there a faster way to rotate a large bitmap by 90 or 270 degrees than simply doing a nested loop with inverted coordinates?

The bitmaps are 8bpp and typically 2048x2400x8bpp

Currently I do this by simply copying with argument inversion, roughly (pseudo code:

for x = 0 to 2048-1
  for y = 0 to 2048-1
    dest[x][y]=src[y][x];

(In reality I do it with pointers, for a bit more speed, but that is roughly the same magnitude)

GDI is quite slow with large images, and GPU load/store times for textures (GF7 cards) are in the same magnitude as the current CPU time.

Any tips, pointers? An in-place algorithm would even be better, but speed is more important than being in-place.

Target is Delphi, but it is more an algorithmic question. SSE(2) vectorization no problem, it is a big enough problem for me to code it in assembler

Follow up to Nils’ answer

Image 2048×2700 -> 2700×2048
Compiler Turbo Explorer 2006 with optimization on.
Windows: Power scheme set to "Always on". (important!!!!)
Machine: Core2 6600 (2.4 GHz)

time with old routine: 32ms (step 1)

time with stepsize 8 : 12ms

time with stepsize 16 : 10ms

time with stepsize 32+ : 9ms

Meanwhile I also tested on a Athlon 64 X2 (5200+ iirc), and the speed up there was slightly more than a factor four (80 to 19 ms).

The speed up is well worth it, thanks. Maybe that during the summer months I’ll torture myself with a SSE(2) version. However I already thought about how to tackle that, and I think I’ll run out of SSE2 registers for an straight implementation:

for n:=0 to 7 do
  begin
    load r0, <source+n*rowsize> 
    shift byte from r0 into r1
    shift byte from r0 into r2
    ..
    shift byte from r0 into r8
  end; 
store r1, <target>   
store r2, <target+1*<rowsize>
..
store r8, <target+7*<rowsize>

So 8×8 needs 9 registers, but 32-bits SSE only has 8. Anyway that is something for the summer months 🙂

Note that the pointer thing is something that I do out of instinct, but it could be there is actually something to it, if your dimensions are not hardcoded, the compiler can’t turn the mul into a shift. While muls an sich are cheap nowadays, they also generate more register pressure afaik.

The code (validated by subtracting result from the "naieve" rotate1 implementation):

const stepsize = 32;
procedure rotatealign(Source: tbw8image; Target:tbw8image);

var stepsx,stepsy,restx,resty : Integer;
   RowPitchSource, RowPitchTarget : Integer;
   pSource, pTarget,ps1,ps2 : pchar;
   x,y,i,j: integer;
   rpstep : integer;
begin
  RowPitchSource := source.RowPitch;          // bytes to jump to next line. Can be negative (includes alignment)
  RowPitchTarget := target.RowPitch;        rpstep:=RowPitchTarget*stepsize;
  stepsx:=source.ImageWidth div stepsize;
  stepsy:=source.ImageHeight div stepsize;
  // check if mod 16=0 here for both dimensions, if so -> SSE2.
  for y := 0 to stepsy - 1 do
    begin
      psource:=source.GetImagePointer(0,y*stepsize);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(target.imagewidth-(y+1)*stepsize,0);
      for x := 0 to stepsx - 1 do
        begin
          for i := 0 to stepsize - 1 do
            begin
              ps1:=@psource[rowpitchsource*i];   // ( 0,i)
              ps2:=@ptarget[stepsize-1-i];       //  (maxx-i,0);
              for j := 0 to stepsize - 1 do
               begin
                 ps2[0]:=ps1[j];
                 inc(ps2,RowPitchTarget);
               end;
            end;
          inc(psource,stepsize);
          inc(ptarget,rpstep);
        end;
    end;
  // 3 more areas to do, with dimensions
  // - stepsy*stepsize * restx        // right most column of restx width
  // - stepsx*stepsize * resty        // bottom row with resty height
  // - restx*resty                    // bottom-right rectangle.
  restx:=source.ImageWidth mod stepsize;   // typically zero because width is 
                                          // typically 1024 or 2048
  resty:=source.Imageheight mod stepsize;
  if restx>0 then
    begin
      // one loop less, since we know this fits in one line of  "blocks"
      psource:=source.GetImagePointer(source.ImageWidth-restx,0);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(Target.imagewidth-stepsize,Target.imageheight-restx);
      for y := 0 to stepsy - 1 do
        begin
          for i := 0 to stepsize - 1 do
            begin
              ps1:=@psource[rowpitchsource*i];   // ( 0,i)
              ps2:=@ptarget[stepsize-1-i];       //  (maxx-i,0);
              for j := 0 to restx - 1 do
               begin
                 ps2[0]:=ps1[j];
                 inc(ps2,RowPitchTarget);
               end;
            end;
         inc(psource,stepsize*RowPitchSource);
         dec(ptarget,stepsize);
       end;
    end;
  if resty>0 then
    begin
      // one loop less, since we know this fits in one line of  "blocks"
      psource:=source.GetImagePointer(0,source.ImageHeight-resty);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(0,0);
      for x := 0 to stepsx - 1 do
        begin
          for i := 0 to resty- 1 do
            begin
              ps1:=@psource[rowpitchsource*i];   // ( 0,i)
              ps2:=@ptarget[resty-1-i];       //  (maxx-i,0);
              for j := 0 to stepsize - 1 do
               begin
                 ps2[0]:=ps1[j];
                 inc(ps2,RowPitchTarget);
               end;
            end;
         inc(psource,stepsize);
         inc(ptarget,rpstep);
       end;
    end;
 if (resty>0) and (restx>0) then
    begin
      // another loop less, since only one block
      psource:=source.GetImagePointer(source.ImageWidth-restx,source.ImageHeight-resty);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(0,target.ImageHeight-restx);
      for i := 0 to resty- 1 do
        begin
          ps1:=@psource[rowpitchsource*i];   // ( 0,i)
          ps2:=@ptarget[resty-1-i];       //  (maxx-i,0);
          for j := 0 to restx - 1 do
            begin
              ps2[0]:=ps1[j];
              inc(ps2,RowPitchTarget);
            end;
       end;
    end;
end;

Update 2 Generics

I tried to update this code to a generics version in Delphi XE. I failed because of QC 99703, and forum people have already confirmed it also exists in XE2. Please vote for it 🙂

Update 3 Generics
Works now in XE10

Update 4

In 2017 i did some work on a assembler version for 8×8 cubes of 8bpp images only and related SO question about shuffle bottlenecks where Peter Cordes generously helped me out. This code still has a missed oportunity and still needs another looptiling level again to aggregate multiple 8×8 block iterations into pseudo larger ones like 64×64. Now it is whole lines again and that is wasteful.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-11T16:58:33+00:00

Yes, there are faster ways to do this.

Your simple loop spends most of the time in cache misses. This happends because you touch a lot of data at very different places in a tight loop. Even worse: Your memory locations are exactly a power of two apart. That’s a size where the cache performs worst.

You can improve this rotation algorithm if you improve the locality of your memory accesses.

A simple way to do this would be to rotate each 8×8 pixel block on it’s own using the same code you’ve used for your whole bitmap, and wrap another loop that splits the image rotation into chunks of 8×8 pixels each.

E.g. something like this (not checked, and sorry for the C-code. My Delphi skills aren’t up to date):

 // this is the outer-loop that breaks your image rotation
 // into chunks of 8x8 pixels each:
 for (int block_x = 0; block_x < 2048; block_x+=8)
 {
    for (int block_y = 0; blocky_y < 2048; block_y+=8)
    { 
       // this is the inner-loop that processes a block
       // of 8x8 pixels.
       for (int x= 0; x<8; x++)
         for (int y=0; y<8; y++)
            dest[x+block_x][y+block_y] = src[y+block_y][x+block_x]
    }
 }

There are other ways as well. You could process the data in Hilbert-Order or Morton-Order. That would be in theory even a bit faster, but the code will be much more complex.

Btw – Since you’ve mentioned that SSE is an option for you. Note that you can rotate a 8×8 byte block within the SSE-registers. It’s a bit tricky to get it working, but looking at SSE matrix transpose code should get you started as it’s the same thing.

EDIT:

Just checked:

With a block-size of 8×8 pixels the code runs ca. 5 times faster on my machine. With a block-size of 16×16 it runs 10 times faster.

Seems like it’s a good idea to experiment with different block-sizes.

Here is the (very simple) test-program I’ve used:

#include <stdio.h>
#include <windows.h>

char temp1[2048*2048];
char temp2[2048*2048];

void rotate1 (void)
{
  int x,y;
  for (y=0; y<2048; y++)
  for (x=0; x<2048; x++)
    temp2[2048*y+x] = temp1[2048*x+y];
}

void rotate2 (void)
{
  int x,y;
  int bx, by;

  for (by=0; by<2048; by+=8)
  for (bx=0; bx<2048; bx+=8)
  for (y=0; y<8; y++)
  for (x=0; x<8; x++)
    temp2[2048*(y+by)+x+bx] = temp1[2048*(x+bx)+y+by];
}

void rotate3 (void)
{
  int x,y;
  int bx, by;

  for (by=0; by<2048; by+=16)
  for (bx=0; bx<2048; bx+=16)
  for (y=0; y<16; y++)
  for (x=0; x<16; x++)
    temp2[2048*(y+by)+x+bx] = temp1[2048*(x+bx)+y+by];
}


int main (int argc, char **args)
{
  int i, t1;

  t1 = GetTickCount();
  for (i=0; i<20; i++) rotate1();
  printf ("%d\n", GetTickCount()-t1);

  t1 = GetTickCount();
  for (i=0; i<20; i++) rotate2();
  printf ("%d\n", GetTickCount()-t1);

  t1 = GetTickCount();
  for (i=0; i<20; i++) rotate3();
  printf ("%d\n", GetTickCount()-t1);

}

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Is there a faster way to rotate a large bitmap by 90 or 270

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply