C-1
C
IA-32 Instruction Latency and
Throughput
This appendix contains tables of the latency, throughput and execution
units that are associated with more-commonly-used IA-32 instructions
1
.
The instruction timing data varies within the IA-32 family of
processors. Only data specific to the Intel Pentium 4, Intel Xeon
processors and Intel Pentium M processor are provided. The relevance
of instruction throughput and latency information for code tuning is
discussed in Chapter 1 and Chapter 2, see “Execution Core Detail” in
Chapter 1 and “Floating Point/SIMD Operands” in Chapter 2.
This appendix contains the following sections:
• “Overview”– an overview of issues related to instruction selection
and scheduling.
• “Definitions” – the definitions for the primary information
presented in the tables in section “Latency and Throughput.”
• “Latency and Throughput of Pentium 4 and Intel Xeon processors”
– the listings of IA-32 instruction throughput, latency and execution
units associated with commonly-used instruction.
1. Although instruction latency may be useful in some limited situations (e.g., a tight loop
with a dependency chain that exposes instruction latency), software optimization on
super-scalar, out-of-order microarchitecture, in general, will benefit much more on
increasing the effective throughput of the larger-scale code path. Coding techniques that
rely on instruction latency alone to influence the scheduling of instruction is likely to be
sub-optimal as such coding technique is likely to interfere with the out-of-order machine or
restrict the amount of instruction-level parallelism.