AMD 250 Computer Hardware User Manual

Open as PDF

of 384

230 Optimizing with SIMD Instructions Chapter 9

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

9.17 Optimized 4 × 4 Matrix Multiplication on 4 × 1

Column Vector Routines

Optimization

Transpose the rotation matrix to eliminate the need to accumulate floating-point values in an XMM

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

The multiplication of a 4 × 4 matrix with a 4 × 1 vector is commonly used in 3-D graphics for

geometric transformation (translating, scaling, rotating, and applying perspective to 3-D points

represented in homogeneous coordinates). Efficiency in single-precision matrix multiplication can be

enhanced by use of SIMD instructions to increase throughput, but there are other general

optimizations that can be implemented to further increase performance. The first optimization is the

transposition of the rotation matrix such that the column n of the matrix becomes the row n and the

row m becomes the column m. This optimization does not benefit 3DNow! technology code (3DNow!

technology has extended instructions that preclude the need for this optimization), but does benefit

SSE code. There are no SSE or SSE2 instructions that accumulate the floats and doubles in a single

XMM register; for this reason, the matrix must be transposed. If the rotation matrix is not transposed,

then the dot-product of a row of the matrix with a column vector necessitates the accumulation of the

four floating-point values in an XMM register. The multiplication upon the column vector is

illustrated here:

|r00 r01 r02 r03| |r00 r10 r20 r30| |v0| |v'0|

tr(R) x v = tr |r10 r11 r12 r13| x v = |r01 r11 r21 r31| x |v1| = |v'1|

|r20 r21 r22 r23| |r02 r12 r22 r32| |v2| |v'2|

|r30 r31 r32 r33| |r03 r13 r23 r33| |v3| |v'3|

Step 0 Step 1 Step 2 Step 3

|v'0| |r00 x v0| |r01 x v1| + |r02 x v2| + |r03 x v3|

|v'1| = |r10 x v0| + |r11 x v1| + |r12 x v2| + |r13 x v3|

|v'2| |r20 x v0| |r21 x v1| + |r22 x v2| + |r23 x v3|

|v'3| |r30 x v0| |r31 x v1| + |r32 x v2| + |r33 x v3|

In each step above, the elements of the rotation matrix can be loaded into an XMM register with the

MOVAPS instruction, assuming the rotation matrix begins at a 16-byte-aligned memory location.

Transposition of the rotation matrix eliminates the need to accumulate the floating-point values in an

previous next

Top Automotive Device Types

Top Automotive Brands

Top Baby Care Device Types

Top Baby Care Brands

Top Car Audio & Video Device Types

Top Car Audio & Video Brands

Top Cellphone Device Types

Top Cellphone Brands

Top Communications Device Types

Top Communications Brands

Top Computer Device Types

Top Computer Brands

Top Fitness Device Types

Top Fitness Brands

Top Home Audio Device Types

Top Home Audio Brands

Top Household Appliance Device Types

Top Household Appliance Brands

Top Kitchen Appliance Device Types

Top Kitchen Appliance Brands

Top Laundry Appliance Device Types

Top Laundry Appliance Brands

Top Lawn & Garden Device Types

Top Lawn & Garden Brands

Top Marine Equipment Device Types

Top Marine Equipment Brands

Top Musical Instrument Device Types

Top Musical Instrument Brands

Top Outdoor Cooking Device Types

Top Outdoor Cooking Brands

Top Personal Care Device Types

Top Personal Care Brands

Top Photography Device Types

Top Photography Brands

Top Portable Media Device Types

Top Portable Media Brands

Top Power Tools Device Types

Top Power Tools Brands

Top TV and Video Device Types

Top TV and Video Brands

Top Videogame Device Types

Top Videogame Brands

AMD 250 Computer Hardware User Manual