Support User Manuals

AMD 250 Computer Hardware User Manual

Open as PDF

of 384

Chapter 9 Optimizing with SIMD Instructions 215

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

9.12 Use XOR Operations to Negate Operands of SSE,

SSE2, and 3DNow!™ Instructions

Optimization

For AMD Athlon, AMD Athlon 64, and AMD Opteron processors, use instructions that perform

XOR operations (PXOR, XORPS, and XORPD) instead of multiplication instructions to change the

sign bit of operands of SSE , SSE2, and 3DNow! instructions.

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

On the AMD Athlon 64 and AMD Opteron processors, using XOR-type instructions allows for more

parallelism, as these instructions can execute in either the FADD or FMUL pipe of the floating-point

unit.

Single Precision

For single-precision, you can use either 3DNow! or SSE SIMD XOR operations. The latency of

multiplying by –1.0 in 3DNow! is 4 cycles, while the latency of using the PXOR instruction is only

2 cycles. Similarly, the latency of the MULPS instruction is 5 cycles, while the latency of the XORPS

instruction is 3 cycles. The following code example illustrates how to toggle the sign bit of a number

using 3DNow! instructions:

signmask DQ 8000000080000000h

pxor mm0, [signmask] ; Toggle sign bits of both floats.

This example does the same thing using SSE instructions:

signmask DQ 8000000080000000h,8000000080000000h

xorps xmm0, [signmask] ; Toggle sign bits of all four floats.

Double Precision

To perform double-precision arithmetic, you can use the XORPD instruction—similar to the single-

precision example—to flip the sign of packed double-precision floating-point operands. The XORPD

instruction takes 3 cycles to execute, whereas the MULPD instruction requires 5 cycles.

signmask DQ 8000000000000000h,8000000000000000h

xorpd xmm0, [signmask] ; Toggle sign bit of both doubles.

previous next