A Detailed Look Inside the Intel
®
NetBurst
™
Micro-Architecture of the Intel Pentium
®
4 Processor
Page 14
§ selecting IA-32 instructions that can be decoded into less than 4 µops and/or have short latencies
§ ordering IA-32 instructions to preserve available parallelism by minimizing long dependence chains and
covering long instruction latencies
§ ordering instructions so that their operands are ready and their corresponding issue ports and execution units
are free when they reach the scheduler.
This subsection describes port restrictions, result latencies, and issue latencies (also referred to as throughput) that
form the basis for that ordering. Scheduling affects the way that instructions are presented to the core of the
processor, but it is the execution core that reacts to an ever-changing machine state, reordering instructions for faster
execution or delaying them because of dependence and resource constraints. Thus the ordering of instructions is
more of a suggestion to the hardware.
The Intel® Pentium® 4 Processor Optimization Reference Manual lists the IA-32 instructions with their latency,
their issue throughput, and in relevant cases, the associated execution units. Some execution units are not pipelined,
such that µops cannot be dispatched in consecutive cycles and the throughput is less than one per cycle.
The number of µops associated with each instruction provides a basis for selecting which instructions to generate. In
particular, µops which are executed out of the microcode ROM involve extra overhead. For the Pentium II and
Pentium III processors, optimizing the performance of the decoder, which includes paying attention to the 4-1-1
sequence (instruction with four µops followed by two instructions each with one µop) and taking into account the
number of µops for each IA-32 instruction, was very important. On the Pentium 4 processor, the decoder template is
not an issue. Therefore it is no longer necessary to use a detailed list of exact µop count for IA-32 instructions.
Commonly used IA-32 instructions, which consist of four or less µops, are provided in the Intel
®
Pentium
®
4
Processor Optimization Reference Manual to aid instruction selection.
Execution Units and Issue Ports
Each cycle, the core may dispatch µops to one or more of the four issue ports. At the micro-architectural level, store
operations are further divided into two parts: store data and store address operations. The four ports through which
µops are dispatched to various execution units and to perform load and store operations are shown in Figure 4. Some
ports can dispatch two µops per clock because the execution unit for that µop executes at twice the speed, and those
execution units are marked “Double speed.”
Port 0. In the first half of the cycle,
port 0 can dispatch either one
floating-point move µop (including
floating-point stack move, floating-
point exchange or floating-point
store data), or one arithmetic
logical unit (ALU) µop (including
arithmetic, logic or store data). In
the second half of the cycle, it can
dispatch one similar ALU µop.
Port 1. In the first half of the cycle,
port 1 can dispatch either one
floating-point execution (all
floating-point operations except
moves, all SIMD operations) µop
or normal-speed integer (multiply,
shift and rotate) µop, or one ALU
(arithmetic, logic or branch) µop.
In the second half of the cycle, it can dispatch one similar ALU µop.
Port 2. Port 2 supports the dispatch of one load operation per cycle.
Note:
FP_ADD refers to x87 FP, and SIMD FP add and subtract operations
FP_MUL refers to x87 FP, and SIMD FP multiply operations
FP_DIV refers to x87 FP, and SIMD FP divide and square-root operations
MMX_ALU refers to SIMD integer arithmetic and logic operations
MMX_SHFT handles Shift, Rotate, Shuffle, Pack and Unpack operations
MMX_MISC handles SIMD reciprocal and some integer operations
Figure 4 Execution Units and Ports of the Out-of-order Core
Memory
Store
Store
Address
Port 3
Memory
Load
All Loads
LEA
Prefetch
Port 2
ALU 1
Double
speed
ADD/SUB
Integer
Operation
Normal
speed
Shift/Rotate
Port 1
FP
Execute
FP_ADD
FP_MUL
FP_DIV
FP_MISC
MMX_SHFT
MMX_ALU
MMX_MISC
Port 0
FP Move
FP Move
FP Store Data
FXCH
ALU 0
Double
speed
ADD/SUB
Logic
Store Data
Branches