IA-32 Instruction Latency and Throughput C
C-19
Table Footnotes
The following footnotes refer to all tables in this appendix.
1. Latency information for many of instructions that are complex
(> 4 μops) are estimates based on conservative and worst-case
estimates. Actual performance of these instructions by the
out-of-order core execution unit can range from somewhat faster to
significantly faster than the nominal latency data shown in these
tables.
2. The names of execution units apply to processor implementations
of the Intel NetBurst microarchitecture only with CPUID signature
of family 15, model encoding = 0, 1, 2. They include:
ALU,
FP_EXECUTE, FPMOVE, MEM_LOAD, MEM_STORE. See Figure 1-4 for
execution units and ports in the out-of-order core. Note the
following:
•The
FP_EXECUTE unit is actually a cluster of execution units,
roughly consisting of seven separate execution units.
•The
FP_ADD unit handles x87 and SIMD floating-point add and
subtract operation.
•The
FP_MUL unit handles x87 and SIMD floating-point multiply
operation.
•The
FP_DIV unit handles x87 and SIMD floating-point divide
square-root operations.
•The
MMX_SHFT unit handles shift and rotate operations.
•The
MMX_ALU unit handles SIMD integer ALU operations.
•The
MMX_MISC unit handles reciprocal MMX computations and
some integer operations.
•The
FP_MISC designates other execution units in port 1 that are
separated from the six units listed above.
3. It may be possible to construct repetitive calls to some IA-32
instructions in code sequences to achieve latency that is one or two
clock cycles faster than the more realistic number listed in this
table.