242 x87 Floating-Point Optimizations Chapter 10
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
Align and Pack DirectPath x87 Instructions
The last optimization to be performed is code packing and alignment. Having an abundance of
operations in the decoder keeps the processor’s schedulers well fed in circumstances where
instructions cannot be immediately provided to the decoders. Floating-point x87 code can be aligned
to 8-byte boundaries as illustrated here, which is optimal on AMD Athlon, AMD Athlon 64, and
AMD Opteron processors:
;Instruction Address Opcode Instruction
;==================================================
00000360 66 DB 066h
00000361 DD 06 fld QWORD PTR [esi]
00000363 66 DB 066h
00000364 DD 07 fld QWORD PTR [edi]
00000366 D8 C9 fmul st(0), st(1)
00000368 DE C7 faddp st(7), st(0)
0000036A DD 04 38 fld QWORD PTR [edi+eax]
0000036D 66 DB 066h
0000036E D8 C9 fmul st(0), st(1)
00000370 DE C6 faddp st(6), st(0)
00000372 DD 04 47 fld QWORD PTR [edi+eax*2]
00000375 66 DB 066h
00000376 D8 C9 fmul st(0), st(1)
00000378 DE C5 faddp st(5), st(0)
0000037A DD 04 3B fld QWORD PTR [edi+ebx]
0000037D 66 DB 066h
0000037E D8 C9 fmul st(0), st(1)
00000380 DE C4 faddp st(4), st(0)
00000382 DD 04 87 fld QWORD PTR [edi+eax*4]
00000385 66 DB 066h
00000386 D8 C9 fmul st(0), st(1)
00000388 DE C3 faddp st(3), st(0)
0000038A DC 0C 39 fmul QWORD PTR [edi+ecx]
0000038D 66 DB 066h
0000038E DE C1 faddp st(1), st(0)
The instruction address specifies the address (in hexadecimal) of the instruction to the right.
Typically three DirectPath instructions occupy 7 bytes. Maintaining 8-byte alignment for the next
group of three instructions requires the addition of a single byte. A 1-byte padding can easily be
achieved using the single-byte NOP instruction (opcode 90h), as recommended in “Code Padding
with Operand-Size Override and NOP” on page 89. However, for the special case of x87 instructions,