I'm seeing 15% performance degradation of the same C++ code compiled to exactly same machine instructions but located on differently aligned addresses. When my tiny main loop starts at 0x415220 it's faster then when it is at 0x415250. I'm running this on Intel Core2 Duo. I use gcc 4.4.5 on x86_64 Ubuntu.
Can anybody explain the cause of slowdown and how I can force gcc to optimally align the loop?
Here is the disassembly for both cases with profiler annotation:
415220 576 12.56% |XXXXXXXXXXXXXX 48 c1 eb 08 shr $0x8,%rbx
415224 110 2.40% |XX 0f b6 c3 movzbl %bl,%eax
415227 0.00% | 41 0f b6 04 00 movzbl (%r8,%rax,1),%eax
41522c 40 0.87% | 48 8b 04 c1 mov (%rcx,%rax,8),%rax
415230 806 17.58% |XXXXXXXXXXXXXXXXXXX 4c 63 f8 movslq %eax,%r15
415233 186 4.06% |XXXX 48 c1 e8 20 shr $0x20,%rax
415237 102 2.22% |XX 4c 01 f9 add %r15,%rcx
41523a 414 9.03% |XXXXXXXXXX a8 0f test $0xf,%al
41523c 680 14.83% |XXXXXXXXXXXXXXXX 74 45 je 415283 ::Run(char const*, char const*)+0x4b3
41523e 0.00% | 41 89 c7 mov %eax,%r15d
415241 0.00% | 41 83 e7 01 and $0x1,%r15d
415245 0.00% | 41 83 ff 01 cmp $0x1,%r15d
415249 0.00% | 41 89 c7 mov %eax,%r15d
415250 679 13.05% |XXXXXXXXXXXXXXXX 48 c1 eb 08 shr $0x8,%rbx
415254 124 2.38% |XX 0f b6 c3 movzbl %bl,%eax
415257 0.00% | 41 0f b6 04 00 movzbl (%r8,%rax,1),%eax
41525c 43 0.83% |X 48 8b 04 c1 mov (%rcx,%rax,8),%rax
415260 828 15.91% |XXXXXXXXXXXXXXXXXXX 4c 63 f8 movslq %eax,%r15
415263 388 7.46% |XXXXXXXXX 48 c1 e8 20 shr $0x20,%rax
415267 141 2.71% |XXX 4c 01 f9 add %r15,%rcx
41526a 634 12.18% |XXXXXXXXXXXXXXX a8 0f test $0xf,%al
41526c 749 14.39% |XXXXXXXXXXXXXXXXXX 74 45 je 4152b3 ::Run(char const*, char const*)+0x4c3
41526e 0.00% | 41 89 c7 mov %eax,%r15d
415271 0.00% | 41 83 e7 01 and $0x1,%r15d
415275 0.00% | 41 83 ff 01 cmp $0x1,%r15d
415279 0.00% | 41 89 c7 mov %eax,%r15d