Support User Manuals

AMD 250 Computer Hardware User Manual

Open as PDF

of 384

Chapter 10 x87 Floating-Point Optimizations 245

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

10.4 Using the FXCH Instruction Rather Than FST/FLD

Pairs

Optimization

Increase parallelism by breaking up dependency chains or by evaluating multiple dependency chains

simultaneously by explicitly switching execution between them.

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

Although the AMD Athlon 64 and AMD Opteron processor’s floating-point unit has a deep

scheduler, which in most cases can extract sufficient parallelism from existing code, long dependency

chains can stall the scheduler while issue slots are still available. The maximum dependency chain

length that the scheduler can absorb is about six four-cycle instructions.

To switch execution between dependency chains, use of the FXCH instruction is recommended

because it has an apparent latency of zero cycles and generates only one micro-op. The floating-point

unit of the AMD Athlon 64 and AMD Opteron processors contains special hardware to handle up to

three FXCH instructions per cycle. Using FXCH is preferred over the use of FST/FLD pairs, even if

the FST/FLD pair works on a register. An FST/FLD pair adds two cycles of latency and consists of

two macro-ops.

previous next