How does loop address alignment affect the speed on Intel x86_64?

Posted by Alexander Gololobov on Stack Overflow See other posts from Stack Overflow or by Alexander Gololobov
Published on 2010-12-25T22:40:25Z Indexed on 2010/12/25 22:54 UTC
Read the original article Hit count: 291

Filed under:
|
|
|
|

I'm seeing 15% performance degradation of the same C++ code compiled to exactly same machine instructions but located on differently aligned addresses. When my tiny main loop starts at 0x415220 it's faster then when it is at 0x415250. I'm running this on Intel Core2 Duo. I use gcc 4.4.5 on x86_64 Ubuntu.

Can anybody explain the cause of slowdown and how I can force gcc to optimally align the loop?

Here is the disassembly for both cases with profiler annotation:

  415220 576      12.56% |XXXXXXXXXXXXXX       48 c1 eb 08           shr    $0x8,%rbx
  415224 110       2.40% |XX                   0f b6 c3              movzbl %bl,%eax
  415227           0.00% |                     41 0f b6 04 00        movzbl (%r8,%rax,1),%eax
  41522c 40        0.87% |                     48 8b 04 c1           mov    (%rcx,%rax,8),%rax
  415230 806      17.58% |XXXXXXXXXXXXXXXXXXX  4c 63 f8              movslq %eax,%r15
  415233 186       4.06% |XXXX                 48 c1 e8 20           shr    $0x20,%rax
  415237 102       2.22% |XX                   4c 01 f9              add    %r15,%rcx
  41523a 414       9.03% |XXXXXXXXXX           a8 0f                 test   $0xf,%al
  41523c 680      14.83% |XXXXXXXXXXXXXXXX     74 45                 je     415283 ::Run(char const*, char const*)+0x4b3>
  41523e           0.00% |                     41 89 c7              mov    %eax,%r15d
  415241           0.00% |                     41 83 e7 01           and    $0x1,%r15d
  415245           0.00% |                     41 83 ff 01           cmp    $0x1,%r15d
  415249           0.00% |                     41 89 c7              mov    %eax,%r15d
  415250 679      13.05% |XXXXXXXXXXXXXXXX     48 c1 eb 08           shr    $0x8,%rbx
  415254 124       2.38% |XX                   0f b6 c3              movzbl %bl,%eax
  415257           0.00% |                     41 0f b6 04 00        movzbl (%r8,%rax,1),%eax
  41525c 43        0.83% |X                    48 8b 04 c1           mov    (%rcx,%rax,8),%rax
  415260 828      15.91% |XXXXXXXXXXXXXXXXXXX  4c 63 f8              movslq %eax,%r15
  415263 388       7.46% |XXXXXXXXX            48 c1 e8 20           shr    $0x20,%rax
  415267 141       2.71% |XXX                  4c 01 f9              add    %r15,%rcx
  41526a 634      12.18% |XXXXXXXXXXXXXXX      a8 0f                 test   $0xf,%al
  41526c 749      14.39% |XXXXXXXXXXXXXXXXXX   74 45                 je     4152b3 ::Run(char const*, char const*)+0x4c3>
  41526e           0.00% |                     41 89 c7              mov    %eax,%r15d
  415271           0.00% |                     41 83 e7 01           and    $0x1,%r15d
  415275           0.00% |                     41 83 ff 01           cmp    $0x1,%r15d
  415279           0.00% |                     41 89 c7              mov    %eax,%r15d

© Stack Overflow or respective owner

Related posts about c++

Related posts about optimization