AMD 250 Computer Hardware User Manual


 
Appendix A Microarchitecture for AMD Athlon™ 64 and AMD Opteron™ Processors 257
Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005
can execute out-of-order. In addition, a particular integer pipe can execute two micro-ops from
different macro-ops (one in the ALU and one in the AGU) at the same time. See Figure 7 on
page 256.
Each of the three ALUs performs general purpose logic functions, arithmetic functions, conditional
functions, divide step functions, status flag multiplexing, and branch resolutions. The AGUs calculate
the logical addresses for loads, stores, and LEAs. A load and store unit reads and writes data to and
from the L1 data cache. The integer scheduler sends a completion status to the ICU when the
outstanding micro-ops for a given macro-op are executed.
All integer operations can be handled within any of the three ALUs with the exception of multiplies.
Multiplies are handled by a pipelined multiplier that is attached to the pipeline at pipe 0, as shown in
Figure 7. Multiplies always issue to integer pipe 0, and the issue logic creates results bus bubbles for
the multiplier in integer pipes 0 and 1 by preventing non-multiply micro-ops from issuing at the
appropriate time.
A.13 Floating-Point Scheduler
The floating-point logic of the AMD Athlon 64 and AMD Opteron processors is a high-performance,
fully pipelined, superscalar, out-of-order execution unit. It is capable of accepting three macro-ops
per cycle from any mixture of the following types of instructions:
x87 floating-point
3DNow! technology
MMX technology
SSE
SSE2
The floating-point scheduler handles register renaming and has a dedicated 36-entry scheduler buffer
organized as 12 lines of three macro-ops each. It also performs data superforwarding, micro-op issue,
and out-of-order execution. The floating-point scheduler communicates with the ICU to retire a
macro-op, to manage comparison results from the FCOMI instruction, and to back out results from a
branch misprediction.
Superforwarding is a performance optimization. It allows a floating point operation having a
dependency on a register to be scheduled sooner when that register is waiting to be filled by a pure
load from memory. Instead of waiting for the first instruction to write its load-data to the register and
then waiting for the second instruction to read it, the load-data can be provided directly to the
dependent instruction, much like regular forwarding between FPU-only operations. The result from
the load is said to be "superforwarded" to the floating-point operation. In the following example, the
FADD can be scheduled to execute as soon as the load operation fetches its data rather than having to
wait and read it out of the register file.