336 Instruction Latencies Appendix C
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
PMULHUW xmmreg1,
xmmreg2
66h 0Fh E4h Double FMUL 4 1/2
PMULHUW xmmreg,
mem128
66h 0Fh E4h Double FMUL 6 1/2
PMULHW xmmreg1,
xmmreg2
66h 0Fh E5h Double FMUL 4 1/2
PMULHW xmmreg,
mem128
66h 0Fh E5h Double FMUL 6 1/2
PMULLW xmmreg1,
xmmreg2
66h 0Fh D5h Double FMUL 4 1/2
PMULLW xmmreg,
mem128
66h 0Fh D5h Double FMUL 6 1/2
PMULUDQ mmreg1,
mmreg2
0Fh F4h DirectPath FMUL 3 1/2
PMULUDQ mmreg,
mem64
0Fh F4h DirectPath FMUL 5 1/2
PMULUDQ xmmreg1,
xmmreg2
66h 0Fh F4h Double FMUL 4 1/2
PMULUDQ xmmreg,
mem128
66h 0Fh F4h Double FMUL 6 1/2
POR xmmreg1,
xmmreg2
66h 0Fh EBh Double FADD/
FMUL
21/1
POR xmmreg, mem128 66h 0Fh EBh Double FADD/
FMUL
41/1
PSADBW xmmreg1,
xmmreg2
66h 0Fh F6h Double FADD 4 1/2
PSADBW xmmreg,
mem128
66h 0Fh F6h Double FADD 6 1/2
PSHUFD xmmreg1,
xmmreg2, imm8
66h 0Fh 70h VectorPath ~ 4
PSHUFD xmmreg,
mem128, imm8
66h 0Fh 70h VectorPath ~ 6
PSHUFHW xmmreg1,
xmmreg2, imm8
F3h 0Fh 70h Double FADD/
FMUL
21/1
Table 19. SSE2 Instructions (Continued)
Syntax
Encoding
Decode
type
FPU
pipe(s)
Latency
Throughput
Note
Prefix
byte
First
byte
2nd
byte
ModRM byte
Notes:
1. The low half of the result is available one cycle earlier than listed.
2. This is the execution latency for the instruction. The time to complete the external write depends on the memory
speed and the hardware implementation.