AMD 250 Computer Hardware User Manual

Open as PDF

of 384

240 x87 Floating-Point Optimizations Chapter 10

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

first element of rows 0–5 of A can be addressed as follows:

fld QWORD PTR [edi-128] ; Load A[i,0].

fld QWORD PTR [edi+eax-128] ; Load A[i+1,0].

fld QWORD PTR [edi+eax*2-128] ; Load A[i+2,0].

fld QWORD PTR [edi+ebx-128] ; Load A[i+3,0].

fld QWORD PTR [edi+eax*4-128] ; Load A[i+4,0].

fld QWORD PTR [edi+ecx-128] ; Load A[i+5,0].

This addressing scheme reduces the size of all loads from memory to 3 bytes; additionally, to address

rows 6–11 of A, you only need to add 240*6 to EDI.

Avoid Register Dependencies by Spacing Apart Instructions that Accumulate Results

in a Register

The second general optimization to consider is spacing out register dependencies. Operations

internally in the floating-point unit have an execution latency (normally 3–4 cycles for x87

operations). Consider this instruction sequence:

fldz ; Push 0.0 onto floating-point stack.

fld QWORD PTR [edi-128] ; Push A[i,0] onto stack.

fmul QWORD PTR [esi-128] ; Multiply A[i,0] by B[0,j].

faddp st(1), st(0) ; Accumulate contribution to dot product of

; A’s row i and B’s column j.

fld QWORD PTR [edi-120] ; Push A[i,1] onto stack.

fmul QWORD PTR [esi-120] ; Multiply A[i,1] by B[1,j].

faddp st(1), st(0) ; Accumulate contribution to dot product of

; A’s row i and B’s column j.

fld QWORD PTR [edi-112] ; Push A[i,2] onto stack.

fmul QWORD PTR [esi-112] ; Multiply A[i,2] by B[2,j].

faddp st(1), st(0) ; Accumulate contribution to dot product of

; A’s row i and B’s column j.

The second statement loads A[0] into ST(0), and the third statement multiplies it by B[0]. The

subsequent line adds this product to ST(1), where the dot product of row 0 of matrix A and column 0

of matrix B is accumulated. Each of the subsequent groups of three instructions adds the contribution

of the remaining 29 elements to the dot product. This code is poor because all the addition operations

depend upon the contents of a single register, ST(1). The AMD Athlon, AMD Athlon 64 and

AMD Opteron processors have out-of-order-execution floating-point units, but none of the addition

operations can be performed out of order because the result of each addition operation depends on the

outcome of the previous addition operation. Instruction scheduling based on this code greatly limits

the throughput of the floating-point unit. To alleviate this, space out operations that are dependent on

one another. In this case, work with six rows of A rather than one at a time, as follows:

; Multiply first element of each of six rows of A by first element of

; B’s column j.

fldz ; Push 0.0 six times onto floating-point stack.

fldz

previous next

Top Automotive Device Types

Top Automotive Brands

Top Baby Care Device Types

Top Baby Care Brands

Top Car Audio & Video Device Types

Top Car Audio & Video Brands

Top Cellphone Device Types

Top Cellphone Brands

Top Communications Device Types

Top Communications Brands

Top Computer Device Types

Top Computer Brands

Top Fitness Device Types

Top Fitness Brands

Top Home Audio Device Types

Top Home Audio Brands

Top Household Appliance Device Types

Top Household Appliance Brands

Top Kitchen Appliance Device Types

Top Kitchen Appliance Brands

Top Laundry Appliance Device Types

Top Laundry Appliance Brands

Top Lawn & Garden Device Types

Top Lawn & Garden Brands

Top Marine Equipment Device Types

Top Marine Equipment Brands

Top Musical Instrument Device Types

Top Musical Instrument Brands

Top Outdoor Cooking Device Types

Top Outdoor Cooking Brands

Top Personal Care Device Types

Top Personal Care Brands

Top Photography Device Types

Top Photography Brands

Top Portable Media Device Types

Top Portable Media Brands

Top Power Tools Device Types

Top Power Tools Brands

Top TV and Video Device Types

Top TV and Video Brands

Top Videogame Device Types

Top Videogame Brands

AMD 250 Computer Hardware User Manual