322 Instruction Latencies Appendix C
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
PEXTRW reg32/64,
mmreg, imm8
0Fh C5h Double - 4 4
PINSRW mmreg,
reg32/64, imm8
0Fh C4h Double - 9 4
PINSRW mmreg,
mem16, imm8
0Fh C4h DirectPath - 4 4
PMAXSW mmreg1,
mmreg2
0Fh EEh 11-xxx-xxx DirectPath FADD/FMUL 2
PMAXSW mmreg,
mem64
0Fh EEh mm-xxx-xxx DirectPath FADD/FMUL 4
PMAXUB mmreg1,
mmreg2
0Fh DEh 11-xxx-xxx DirectPath FADD/FMUL 2
PMAXUB mmreg,
mem64
0Fh DEh mm-xxx-xxx DirectPath FADD/FMUL 4
PMINSW mmreg1,
mmreg2
0Fh EAh 11-xxx-xxx DirectPath FADD/FMUL 2
PMINSW mmreg,
mem64
0Fh EAh mm-xxx-xxx DirectPath FADD/FMUL 4
PMINUB mmreg1,
mmreg2
0Fh DAh 11-xxx-xxx DirectPath FADD/FMUL 2
PMINUB mmreg,
mem64
0Fh DAh mm-xxx-xxx DirectPath FADD/FMUL 4
PMOVMSKB reg32/64,
mmreg
0Fh D7h VectorPath - 3 4
PMULHUW mmreg1,
mmreg2
0Fh E4h 11-xxx-xxx DirectPath FMUL 3
PMULHUW mmreg,
mem64
0Fh E4h mm-xxx-xxx DirectPath FMUL 5
PREFETCHNTA mem8 0Fh 18h mm-000-xxx DirectPath ~ ~ 5
Table 18. SSE Instructions (Continued)
Syntax
Encoding
Decode
type
FPU pipe(s) Latency Note
Prefix
byte
First
byte
2nd
byte
ModRM byte
Notes:
1. The low half of the result is available one cycle earlier than listed.
2. The second latency value indicates when the low half of the result becomes available.
3. The high half of the result is available one cycle earlier than listed.
4. The latency listed is the absolute minimum, while average latencies may be higher and are a function of internal
pipeline conditions.
5. For the PREFETCHNTA/T0/T1/T2 instructions, the mem8 value refers to an address in the 64-byte line to be
prefetched.
6. The 8-clock latency is only visible to younger stores that need to do an external write. The 2-clock latency is
visible to the other stores and instructions.
7. This is the execution latency for the instruction. The time to complete the external write depends on the memory
speed and the hardware implementation.