Appendix C Instruction Latencies 317
Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005
C.7 SSE Instructions
Table 18. SSE Instructions
Syntax
Encoding
Decode
type
FPU pipe(s) Latency Note
Prefix
byte
First
byte
2nd
byte
ModRM byte
ADDPS xmmreg1,
xmmreg2
0Fh 58h 11-xxx-xxx Double FADD 5 1
ADDPS xmmreg,
mem128
0Fh 58h mm-xxx-xxx Double FADD 7 1
ADDSS xmmreg1,
xmmreg2
F3h 0Fh 58h 11-xxx-xxx DirectPath FADD 4
ADDSS xmmreg,
mem128
F3h 0Fh 58h mm-xxx-xxx DirectPath FADD 6
ANDNPS xmmreg1,
xmmreg2
0Fh 55h 11-xxx-xxx Double FMUL 3 1
ANDNPS xmmreg,
mem128
0Fh 55h mm-xxx-xxx Double FMUL 5 1
ANDPS xmmreg1,
xmmreg2
0Fh 54h 11-xxx-xxx Double FMUL 3 1
ANDPS xmmreg,
mem128
0Fh 54h mm-xxx-xxx Double FMUL 5 1
CMPPS xmmreg1,
xmmreg2, imm8
0Fh C2h 11-xxx-xxx Double FADD 3 1
CMPPS xmmreg,
mem128, imm8
0Fh C2h mm-xxx-xxx Double FADD 5 1
CMPSS xmmreg1,
xmmreg2, imm8
F3h 0Fh C2h 11-xxx-xxx DirectPath FADD 2
CMPSS xmmreg,
mem32, imm8
F3h 0Fh C2h mm-xxx-xxx DirectPath FADD 4
COMISS xmmreg1,
xmmreg2
0Fh 2Fh 11-xxx-xxx VectorPath 4
Notes:
1. The low half of the result is available one cycle earlier than listed.
2. The second latency value indicates when the low half of the result becomes available.
3. The high half of the result is available one cycle earlier than listed.
4. The latency listed is the absolute minimum, while average latencies may be higher and are a function of internal
pipeline conditions.
5. For the PREFETCHNTA/T0/T1/T2 instructions, the mem8 value refers to an address in the 64-byte line to be
prefetched.
6. The 8-clock latency is only visible to younger stores that need to do an external write. The 2-clock latency is
visible to the other stores and instructions.
7. This is the execution latency for the instruction. The time to complete the external write depends on the memory
speed and the hardware implementation.