AMD 250 Computer Hardware User Manual


 
Chapter 10 x87 Floating-Point Optimizations 241
Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005
fldz
fldz
fldz
fldz
fld QWORD PTR [esi-128] ; Push B[0,j] onto stack.
fld QWORD PTR [edi-128] ; Push A[i,0] onto stack.
fmul st(0), st(1) ; Multiply A[i,0] by B[0,j].
faddp st(7), st(0) ; Accumulate contribution to dot product of
; A’s row i and B’s column j.
fld QWORD PTR [edi+eax-128] ; Push A[i+1,0] onto stack.
fmul st(0), st(1) ; Multiply A[i+1,0] by B[0,j].
faddp st(6), st(0) ; Accumulate contribution to dot product of
; A’s row i+1 and B’s column j.
fld QWORD PTR [edi+eax*2-128] ; Push A[i+2,0] onto stack.
fmul st(0), st(1) ; Multiply A[i+2,0] by B[0,j].
faddp st(5), st(0) ; Accumulate contribution to dot product of
; A’s row i+2 and B’s column j.
fld QWORD PTR [edi+ebx-128] ; Push A[i+3,0] onto stack.
fmul st(0), st(1) ; Multiply A[i+3,0] by B[0,j].
faddp st(4), st(0) ; Accumulate contribution to dot product of
; A’s row i+3 and B’s column j.
fld QWORD PTR [edi+eax*4-128] ; Push A[i+4,0] onto stack.
fmul st(0), st(1) ; Multiply A[i+4,0] by B[0,j].
faddp st(3), st(0) ; Accumulate contribution to dot product of
; A’s row i+4 and B’s column j.
fmul QWORD PTR [edi+ecx-128] ; Multiply A[i+5,0] by B[0,j].
faddp st(1), st(0) ; Accumulate contribution to dot product of
; A’s row i+5 and B’s column j.
The processor can execute the instructions in this code sequence out of order because the instructions
are independent. Even though the loads and multiplies are performed sequentially, the floating-point
scheduler can execute the FLD and FMUL instructions out of order in addition to the FADD
instruction so as to keep the multiplier and adder pipes of the floating-point unit busy. B[0] is initially
loaded into an x87 register and multiplied by the loaded elements of each row with the
reg
,
reg
form of FMUL to minimize the number of load operations that need to be performed. Additionally,
the first element from the sixth row of A is not loaded but simply multiplied from memory by the
loaded element of B[0]. This eliminates an FLD instruction and decreases the number of instructions
in the instruction cache and the workload on the processor’s decoder. To achieve two floating-point
operations per clock cycle, the number of floating-point operations should be twice the number of
load-store operations. In the example above, there are 12 floating-point operations and seven
operations requiring loads from memory, so nearly two floating-point operations can be performed
per clock cycle.