Appendix C Instruction Latencies 323
Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005
PREFETCHT0 mem8 0Fh 18h mm-001-xxx DirectPath ~ ~ 5
PREFETCHT1 mem8 0Fh 18h mm-010-xxx DirectPath ~ ~ 5
PREFETCHT2 mem8 0Fh 18h mm-011-xxx DirectPath ~ ~ 5
PSADBW mmreg1,
mmreg2
0Fh F6h 11-xxx-xxx DirectPath FADD 3
PSADBW mmreg,
mem64
0Fh F6h mm-xxx-xxx DirectPath FADD 5
PSHUFW mmreg1,
mmreg2, imm8
0Fh 70h DirectPath FADD/FMUL 2
PSHUFW mmreg,
mem64, imm8
0Fh 70h DirectPath FADD/FMUL 4
RCPPS xmmreg1,
xmmreg2
0Fh 53h 11-xxx-xxx Double FMUL 4 1
RCPPS xmmreg,
mem128
0Fh 53h mm-xxx-xxx Double FMUL 6 1
RCPSS xmmreg1,
xmmreg2
F3h 0Fh 53h 11-xxx-xxx DirectPath FMUL 3
RCPSS xmmreg,
mem32
F3h 0Fh 53h mm-xxx-xxx DirectPath FMUL 5
RSQRTPS xmmreg1,
xmmreg2
0Fh 52h 11-xxx-xxx Double FMUL 4 1
RSQRTPS xmmreg,
mem128
0Fh 52h mm-xxx-xxx Double FMUL 6 1
RSQRTSS xmmreg1,
xmmreg2
F3h 0Fh 52h 11-xxx-xxx DirectPath FMUL 3
RSQRTSS xmmreg,
mem32
F3h 0Fh 52h mm-xxx-xxx DirectPath FMUL 5
SFENCE 0Fh AEh 11-111-000 VectorPath 2/8 6
Table 18. SSE Instructions (Continued)
Syntax
Encoding
Decode
type
FPU pipe(s) Latency Note
Prefix
byte
First
byte
2nd
byte
ModRM byte
Notes:
1. The low half of the result is available one cycle earlier than listed.
2. The second latency value indicates when the low half of the result becomes available.
3. The high half of the result is available one cycle earlier than listed.
4. The latency listed is the absolute minimum, while average latencies may be higher and are a function of internal
pipeline conditions.
5. For the PREFETCHNTA/T0/T1/T2 instructions, the mem8 value refers to an address in the 64-byte line to be
prefetched.
6. The 8-clock latency is only visible to younger stores that need to do an external write. The 2-clock latency is
visible to the other stores and instructions.
7. This is the execution latency for the instruction. The time to complete the external write depends on the memory
speed and the hardware implementation.