AMD 250 Manual

next previous

Contents vii

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

9.11 Using SIMD Instructions for Fast Square Roots and Fast

Reciprocal Square Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .211

9.12 Use XOR Operations to Negate Operands of SSE, SSE2, and

3DNow!™ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .215

9.13 Clearing MMX™ and XMM Registers with XOR Instructions . . . . . . . . . . . . . . . .216

9.14 Finding the Floating-Point Absolute Value of Operands of SSE, SSE2, and

3DNow!™ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .217

9.15 Accumulating Single-Precision Floating-Point Numbers Using SSE, SSE2,

and 3DNow!™ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .218

9.16 Complex-Number Arithmetic Using SSE, SSE2, and 3DNow!™ Instructions . . . .221

9.17 Optimized 4 × 4 Matrix Multiplication on 4 × 1 Column Vector Routines . . . . . . .230

Chapter 10 x87 Floating-Point Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .237

10.1 Using Multiplication Rather Than Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .238

10.2 Achieving Two Floating-Point Operations per Clock Cycle . . . . . . . . . . . . . . . . . .239

10.3 Floating-Point Compare Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .244

10.4 Using the FXCH Instruction Rather Than FST/FLD Pairs . . . . . . . . . . . . . . . . . . . .245

10.5 Floating-Point Subexpression Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .246

10.6 Accumulating Precision-Sensitive Quantities in x87 Registers . . . . . . . . . . . . . . . .247

10.7 Avoiding Extended-Precision Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .248

Appendix A Microarchitecture for AMD Athlon™ 64 and AMD Opteron™ Processors . .249

A.1 Key Microarchitecture Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .250

A.2 Microarchitecture for AMD Athlon™ 64 and AMD Opteron™ Processors . . . . . .251

A.3 Superscalar Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .251

A.4 Processor Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .251

A.5 L1 Instruction Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .252

A.6 Branch-Prediction Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .253

A.7 Fetch-Decode Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .254

A.8 Instruction Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .254

A.9 Translation-Lookaside Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .254

A.10 L1 Data Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .255

A.11 Integer Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .256