320 Instruction Latencies Appendix C
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
MOVAPS xmmreg1,
xmmreg2
0Fh 29h 11-xxx-xxx Double 2
MOVAPS mem128,
xmmreg
0Fh 29h mm-xxx-xxx Double 3 1
MOVHLPS xmmreg1,
xmmreg2
0Fh 12h 11-xxx-xxx DirectPath 2
MOVHPS xmmreg,
mem64
0Fh 16h mm-xxx-xxx DirectPath 2
MOVHPS mem64,
xmmreg
0Fh 17h mm-xxx-xxx DirectPath 2
MOVLHPS xmmreg1,
xmmreg2
0Fh 16h 11-xxx-xxx DirectPath 2
MOVLPS xmmreg,
mem64
0Fh 12h mm-xxx-xxx DirectPath 2
MOVLPS mem64,
xmmreg
0Fh 13h mm-xxx-xxx DirectPath 2
MOVMSKPS reg32,
xmmreg
0Fh 50h 11-xxx-xxx VectorPath 3
MOVNTPS mem128,
xmmreg
0Fh 2Bh mm-xxx-xxx Double 3 7
MOVNTQ mem64,
mmreg
0Fh E7h mm-xxx-xxx DirectPath FSTORE 2 7
MOVSS xmmreg1,
xmmreg2
F3h 0Fh 10h 11-xxx-xxx DirectPath 2
MOVSS xmmreg,
mem32
F3h 0Fh 10h mm-xxx-xxx Double 3
MOVSS xmmreg1,
xmmreg2
F3h 0Fh 11h 11-xxx-xxx DirectPath 2
Table 18. SSE Instructions (Continued)
Syntax
Encoding
Decode
type
FPU pipe(s) Latency Note
Prefix
byte
First
byte
2nd
byte
ModRM byte
Notes:
1. The low half of the result is available one cycle earlier than listed.
2. The second latency value indicates when the low half of the result becomes available.
3. The high half of the result is available one cycle earlier than listed.
4. The latency listed is the absolute minimum, while average latencies may be higher and are a function of internal
pipeline conditions.
5. For the PREFETCHNTA/T0/T1/T2 instructions, the mem8 value refers to an address in the 64-byte line to be
prefetched.
6. The 8-clock latency is only visible to younger stores that need to do an external write. The 2-clock latency is
visible to the other stores and instructions.
7. This is the execution latency for the instruction. The time to complete the external write depends on the memory
speed and the hardware implementation.