AMD 250 Computer Hardware User Manual


 
Chapter 9 Optimizing with SIMD Instructions 215
Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005
9.12 Use XOR Operations to Negate Operands of SSE,
SSE2, and 3DNow!™ Instructions
Optimization
For AMD Athlon, AMD Athlon 64, and AMD Opteron processors, use instructions that perform
XOR operations (PXOR, XORPS, and XORPD) instead of multiplication instructions to change the
sign bit of operands of SSE , SSE2, and 3DNow! instructions.
Application
This optimization applies to:
32-bit software
64-bit software
Rationale
On the AMD Athlon 64 and AMD Opteron processors, using XOR-type instructions allows for more
parallelism, as these instructions can execute in either the FADD or FMUL pipe of the floating-point
unit.
Single Precision
For single-precision, you can use either 3DNow! or SSE SIMD XOR operations. The latency of
multiplying by –1.0 in 3DNow! is 4 cycles, while the latency of using the PXOR instruction is only
2 cycles. Similarly, the latency of the MULPS instruction is 5 cycles, while the latency of the XORPS
instruction is 3 cycles. The following code example illustrates how to toggle the sign bit of a number
using 3DNow! instructions:
signmask DQ 8000000080000000h
pxor mm0, [signmask] ; Toggle sign bits of both floats.
This example does the same thing using SSE instructions:
signmask DQ 8000000080000000h,8000000080000000h
xorps xmm0, [signmask] ; Toggle sign bits of all four floats.
Double Precision
To perform double-precision arithmetic, you can use the XORPD instruction—similar to the single-
precision example—to flip the sign of packed double-precision floating-point operands. The XORPD
instruction takes 3 cycles to execute, whereas the MULPD instruction requires 5 cycles.
signmask DQ 8000000000000000h,8000000000000000h
xorpd xmm0, [signmask] ; Toggle sign bit of both doubles.