Is there a faster way to rotate a large bitmap by 90 or 270 degrees than simply doing a nested loop with inverted coordinates?
The bitmaps are 8bpp and typically 2048*2400*8bpp
Currently I do this by simply copying with argument inversion, roughly (pseudo code:
for x = 0 to 2048-1
for y = 0 to 2048-1
dest[x][y]=src[y][x];
(In reality I do it with pointers, for a bit more speed, but that is roughly the same magnitude)
GDI is quite slow with large images, and GPU load/store times for textures (GF7 cards) are in the same magnitude as the current CPU time.
Any tips, pointers? An in-place algorithm would even be better, but speed is more important than being in-place.
Target is Delphi, but it is more an algorithmic question. SSE(2) vectorization no problem, it is a big enough problem for me to code it in assembler
Duplicates How do you rotate a two dimensional array?.
Follow up to Nils' answer
Image 2048x2700 - 2700x2048
Compiler Turbo Explorer 2006 with optimization on.
Windows: Power scheme set to "Always on". (important!!!!)
Machine: Core2 6600 (2.4 GHz)
time with old routine: 32ms (step 1)
time with stepsize 8 : 12ms
time with stepsize 16 : 10ms
time with stepsize 32+ : 9ms
Meanwhile I also tested on a Athlon 64 X2 (5200+ iirc), and the speed up there was slightly more than a factor four (80 to 19 ms).
The speed up is well worth it, thanks. Maybe that during the summer months I'll torture myself with a SSE(2) version. However I already thought about how to tackle that, and I think I'll run out of SSE2 registers for an straight implementation:
for n:=0 to 7 do
begin
load r0, <source+n*rowsize>
shift byte from r0 into r1
shift byte from r0 into r2
..
shift byte from r0 into r8
end;
store r1, <target>
store r2, <target+1*<rowsize>
..
store r8, <target+7*<rowsize>
So 8x8 needs 9 registers, but 32-bits SSE only has 8. Anyway that is something for the summer months :-)
Note that the pointer thing is something that I do out of instinct, but it could be there is actually something to it, if your dimensions are not hardcoded, the compiler can't turn the mul into a shift. While muls an sich are cheap nowadays, they also generate more register pressure afaik.
The code (validated by subtracting result from the "naieve" rotate1 implementation):
const stepsize = 32;
procedure rotatealign(Source: tbw8image; Target:tbw8image);
var stepsx,stepsy,restx,resty : Integer;
RowPitchSource, RowPitchTarget : Integer;
pSource, pTarget,ps1,ps2 : pchar;
x,y,i,j: integer;
rpstep : integer;
begin
RowPitchSource := source.RowPitch; // bytes to jump to next line. Can be negative (includes alignment)
RowPitchTarget := target.RowPitch; rpstep:=RowPitchTarget*stepsize;
stepsx:=source.ImageWidth div stepsize;
stepsy:=source.ImageHeight div stepsize;
// check if mod 16=0 here for both dimensions, if so -> SSE2.
for y := 0 to stepsy - 1 do
begin
psource:=source.GetImagePointer(0,y*stepsize); // gets pointer to pixel x,y
ptarget:=Target.GetImagePointer(target.imagewidth-(y+1)*stepsize,0);
for x := 0 to stepsx - 1 do
begin
for i := 0 to stepsize - 1 do
begin
ps1:=@psource[rowpitchsource*i]; // ( 0,i)
ps2:=@ptarget[stepsize-1-i]; // (maxx-i,0);
for j := 0 to stepsize - 1 do
begin
ps2[0]:=ps1[j];
inc(ps2,RowPitchTarget);
end;
end;
inc(psource,stepsize);
inc(ptarget,rpstep);
end;
end;
// 3 more areas to do, with dimensions
// - stepsy*stepsize * restx // right most column of restx width
// - stepsx*stepsize * resty // bottom row with resty height
// - restx*resty // bottom-right rectangle.
restx:=source.ImageWidth mod stepsize; // typically zero because width is
// typically 1024 or 2048
resty:=source.Imageheight mod stepsize;
if restx>0 then
begin
// one loop less, since we know this fits in one line of "blocks"
psource:=source.GetImagePointer(source.ImageWidth-restx,0); // gets pointer to pixel x,y
ptarget:=Target.GetImagePointer(Target.imagewidth-stepsize,Target.imageheight-restx);
for y := 0 to stepsy - 1 do
begin
for i := 0 to stepsize - 1 do
begin
ps1:=@psource[rowpitchsource*i]; // ( 0,i)
ps2:=@ptarget[stepsize-1-i]; // (maxx-i,0);
for j := 0 to restx - 1 do
begin
ps2[0]:=ps1[j];
inc(ps2,RowPitchTarget);
end;
end;
inc(psource,stepsize*RowPitchSource);
dec(ptarget,stepsize);
end;
end;
if resty>0 then
begin
// one loop less, since we know this fits in one line of "blocks"
psource:=source.GetImagePointer(0,source.ImageHeight-resty); // gets pointer to pixel x,y
ptarget:=Target.GetImagePointer(0,0);
for x := 0 to stepsx - 1 do
begin
for i := 0 to resty- 1 do
begin
ps1:=@psource[rowpitchsource*i]; // ( 0,i)
ps2:=@ptarget[resty-1-i]; // (maxx-i,0);
for j := 0 to stepsize - 1 do
begin
ps2[0]:=ps1[j];
inc(ps2,RowPitchTarget);
end;
end;
inc(psource,stepsize);
inc(ptarget,rpstep);
end;
end;
if (resty>0) and (restx>0) then
begin
// another loop less, since only one block
psource:=source.GetImagePointer(source.ImageWidth-restx,source.ImageHeight-resty); // gets pointer to pixel x,y
ptarget:=Target.GetImagePointer(0,target.ImageHeight-restx);
for i := 0 to resty- 1 do
begin
ps1:=@psource[rowpitchsource*i]; // ( 0,i)
ps2:=@ptarget[resty-1-i]; // (maxx-i,0);
for j := 0 to restx - 1 do
begin
ps2[0]:=ps1[j];
inc(ps2,RowPitchTarget);
end;
end;
end;
end;