IA-32 Instruction Latency and Throughput C
C-21
For the sake of simplicity, all data being requested is assumed to reside
in the first level data cache (cache hit). In general, IA-32 instructions
with load operations that execute in the integer ALU units require two
more clock cycles than the corresponding register-to-register flavor of
the same instruction. Throughput of these instructions with load
operation remains the same with the register-to-register flavor of the
instructions.
Floating-point, MMX technology, Streaming SIMD Extensions and
Streaming SIMD Extension 2 instructions with load operations require 6
more clocks in latency than the register-only version of the instructions,
but throughput remains the same.
When store operations are on the critical path, their results can generally
be forwarded to a dependent load in as few as zero cycles. Thus, the
latency to complete and store isn’t relevant here.