Intel IA-32 Computer Accessories User Manual


 
IA-32 Instruction Latency and Throughput C
C-5
accurately predict realistic performance of actual code sequences based
on adding instruction latency data.
The instruction latency data are useful when tuning a dependency
chain. However, dependency chains limit the out-of-order core’s
ability to execute micro-ops in parallel. The instruction throughput
data are useful when tuning parallel code unencumbered by
dependency chains.
All numeric data in the tables are:
approximate and are subject to change in future
implementations of the Intel NetBurst microarchitecture or the
Pentium M processor microarchitecture.
not meant to be used as reference numbers for comparisons of
instruction-level performance benchmarks. Comparison of
instruction-level performance of microprocessors that are based
on different microarchitecture is a complex subject that requires
additional information that is beyond the scope of this manual.
Comparisons of latency and throughput data between the Pentium 4
processor and the Pentium M processor can be misleading, because one
cycle in the Pentium 4 processor is NOT equal to one cycle in the
Pentium M processor. The Pentium 4 processor is designed to operate at
higher clock frequencies than the Pentium M processor. Many IA-32
instructions can operate with either registers as their operands or with a
combination of register/memory address as their operands. The
performance of a given instruction between these two types is different.
The section that follows, “Latency and Throughput with Register
Operands”, gives the latency and throughput data for the
register-to-register instruction type. Section “Latency and Throughput
with Memory Operands” discusses how to adjust latency and
throughput specifications for the register-to-memory and
memory-to-register instructions.
In some cases, the latency or throughput figures given are just one half
of a clock. This occurs only for the double-speed ALUs.