Hello everyone:
I'm now trapped by a performance optimization lab in the book "Computer System from a Programmer's Perspective" described as following:
In a N*N matrix M, where N is multiple of 32, the rotate operation can be represented as:
     Transpose: interchange elements M(i,j) and M(j,i)
     Exchange rows: Row i is exchanged with row N-1-i
A example for matrix rotation(N is 3 instead of 32 for simplicity):
-------                          -------
|1|2|3|                          |3|6|9|
-------                          -------
|4|5|6|    after rotate is       |2|5|8|
-------                          -------
|7|8|9|                          |1|4|7|
-------                          -------
A naive implementation is:
#define RIDX(i,j,n) ((i)*(n)+(j))
    void naive_rotate(int dim, pixel *src, pixel *dst) 
    {
        int i, j;
        for (i = 0; i < dim; i++)
     for (j = 0; j < dim; j++)
         dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
    }
I come up with an idea by inner-loop-unroll. The result is:
Code Version          Speed Up
  original                1x
 unrolled by 2           1.33x
 unrolled by 4           1.33x
 unrolled by 8           1.55x
 unrolled by 16          1.67x
 unrolled by 32          1.61x
I also get a code snippet from pastebin.com that seems can solve this problem:
void rotate(int dim, pixel *src, pixel *dst) 
{
    int stride = 32;
    int count = dim >> 5;
    src += dim - 1; 
    int a1 = count;
    do {
        int a2 = dim;
        do {
            int a3 = stride;
            do {
                *dst++ = *src;
                src += dim;
            } while(--a3);
            src -= dim * stride + 1;
            dst += dim - stride;
        } while(--a2);
        src += dim * (stride + 1);
        dst -= dim * dim - stride;
    } while(--a1);
}
After carefully read the code, I think main idea of this solution is treat 32 rows as a data zone, and perform the rotating operation respectively. Speed up of this version is 1.85x, overwhelming all the loop-unroll version.
Here are the questions:
In the inner-loop-unroll version, why does increment slow down if the unrolling factor increase, especially change the unrolling factor from 8 to 16, which does not effect the same when switch from 4 to 8? Does the result have some relationship with depth of the CPU pipeline? If the answer is yes, could the degrade of increment reflect pipeline length?
What is the probable reason for the optimization of data-zone version? It seems that there is no too much essential difference from the original naive version.
EDIT:
My test environment is Intel Centrino Duo processor and the verion of gcc is 4.4
Any advice will be highly appreciated!
Kind regards!