Appendix C Instruction Latencies 303
Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005
C.3 MMX™ Technology Instructions
Table 14. MMX™ Technology Instructions
Syntax
Encoding
Decode
type
FPU pipe(s) Latency Note
Prefix
byte
First
byte
ModRM byte
EMMS 0Fh 77h DirectPath FADD/FMUL/
FSTORE
62
MOVD mmreg, reg32 0Fh 6Eh 11-xxx-xxx Double - 9 1
MOVD mmreg, reg64 0Fh 6Eh 11-xxx-xxx Double - 9 1
MOVD mmreg, mem32 0Fh 6Eh mm-xxx-xxx DirectPath FADD/FMUL/
FSTORE
42
MOVD mmreg, mem64 0Fh 6Eh mm-xxx-xxx DirectPath FADD/FMUL/
FSTORE
42
MOVD reg32, mmreg 0Fh 7Eh 11-xxx-xxx Double - 4 1
MOVD reg64, mmreg 0Fh 7Eh 11-xxx-xxx Double - 4 1
MOVD mem32, mmreg 0Fh 7Eh mm-xxx-xxx DirectPath FSTORE 2
MOVD mem64, mmreg 0Fh 7Eh mm-xxx-xxx DirectPath FSTORE 2
MOVQ mmreg1, mmreg2 0Fh 6Fh 11-xxx-xxx DirectPath FADD/FMUL 2
MOVQ mmreg, mem64 0Fh 6Fh mm-xxx-xxx DirectPath FADD/FMUL/
FSTORE
42
MOVQ mmreg2, mmreg1 0Fh 7Fh 11-xxx-xxx DirectPath FADD/FMUL 2
MOVQ mem64, mmreg 0Fh 7Fh mm-xxx-xxx DirectPath FSTORE 2
PACKSSDW mmreg1, mmreg2 0Fh 6Bh 11-xxx-xxx DirectPath FADD/FMUL 2
PACKSSDW mmreg, mem64 0Fh 6Bh mm-xxx-xxx DirectPath FADD/FMUL 4
PACKSSWB mmreg1, mmreg2 0Fh 63h 11-xxx-xxx DirectPath FADD/FMUL 2
PACKSSWB mmreg, mem64 0Fh 63h mm-xxx-xxx DirectPath FADD/FMUL 4
PACKUSWB mmreg1, mmreg2 0Fh 67h 11-xxx-xxx DirectPath FADD/FMUL 2
PACKUSWB mmreg, mem64 0Fh 67h mm-xxx-xxx DirectPath FADD/FMUL 4
PADDB mmreg1, mmreg2 0Fh FCh 11-xxx-xxx DirectPath FADD/FMUL 2
PADDB mmreg, mem64 0Fh FCh mm-xxx-xxx DirectPath FADD/FMUL 4
PADDD mmreg1, mmreg2 0Fh FEh 11-xxx-xxx DirectPath FADD/FMUL 2
PADDD mmreg, mem64 0Fh FEh mm-xxx-xxx DirectPath FADD/FMUL 4
PADDSB mmreg1, mmreg2 0Fh ECh 11-xxx-xxx DirectPath FADD/FMUL 2
PADDSB mmreg, mem64 0Fh ECh mm-xxx-xxx DirectPath FADD/FMUL 4
Notes:
1. Bits 2, 1, and 0 of the ModRM byte select the integer register.
2. These instructions have an effective latency as shown. However, these instructions generate an internal NOP
with a latency of two cycles but no related dependencies. These internal NOPs can be executed at a rate of
three per cycle and can use any of the three execution resources.