Appendix C Instruction Latencies 319
Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005
DIVPS xmmreg,
mem128
0Fh 5Eh mm-xxx-xxx Double FMUL 35
DIVSS xmmreg1,
xmmreg2
F3h 0Fh 5Eh 11-xxx-xxx DirectPath FMUL 16
DIVSS xmmreg, mem32 F3h 0Fh 5Eh mm-xxx-xxx DirectPath FMUL 18
LDMXCSR mem32 0Fh AEh mm-010-xxx VectorPath 13 4
MASKMOVQ mmreg1,
mmreg2
0Fh F7h 11-xxx-xxx VectorPath FADD/FMUL/
FSTORE
29
MAXPS xmmreg1,
xmmreg2
0Fh 5Fh 11-xxx-xxx Double FADD 3 1
MAXPS xmmreg,
mem128
0Fh 5Fh mm-xxx-xxx Double FADD 5 1
MAXSS xmmreg1,
xmmreg2
F3h 0Fh 5Fh 11-xxx-xxx DirectPath FADD 2
MAXSS xmmreg,
mem32
F3h 0Fh 5Fh mm-xxx-xxx DirectPath FADD 4
MINPS xmmreg1,
xmmreg2
0Fh 5Dh 11-xxx-xxx Double FADD 3 1
MINPS xmmreg,
mem128
0Fh 5Dh mm-xxx-xxx Double FADD 5 1
MINSS xmmreg1,
xmmreg2
F3h 0Fh 5Dh 11-xxx-xxx DirectPath FADD 2
MINSS xmmreg,
mem32
F3h 0Fh 5Dh mm-xxx-xxx DirectPath FADD 4
MOVAPS xmmreg1,
xmmreg2
0Fh 28h 11-xxx-xxx Double 2
MOVAPS xmmreg,
mem128
0Fh 28h mm-xxx-xxx Double 2
Table 18. SSE Instructions (Continued)
Syntax
Encoding
Decode
type
FPU pipe(s) Latency Note
Prefix
byte
First
byte
2nd
byte
ModRM byte
Notes:
1. The low half of the result is available one cycle earlier than listed.
2. The second latency value indicates when the low half of the result becomes available.
3. The high half of the result is available one cycle earlier than listed.
4. The latency listed is the absolute minimum, while average latencies may be higher and are a function of internal
pipeline conditions.
5. For the PREFETCHNTA/T0/T1/T2 instructions, the mem8 value refers to an address in the 64-byte line to be
prefetched.
6. The 8-clock latency is only visible to younger stores that need to do an external write. The 2-clock latency is
visible to the other stores and instructions.
7. This is the execution latency for the instruction. The time to complete the external write depends on the memory
speed and the hardware implementation.