AMD 250 Computer Hardware User Manual

Open as PDF

of 384

Chapter 9 Optimizing with SIMD Instructions 201

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

“dotted” with a column of B

. Once this is done, the rows of matrix A are “dotted” with the next

column of B

, and the process is repeated through all the columns of B

From a performance standpoint, there are several caveats to recognize, as follows:

• Once all the rows of A have been multiplied with the first column of B, all the rows of A are in the

cache, and subsequent accesses to them do not cause cache misses.

• The rows of B

are brought into the cache by “dotting” the first four rows of A with each row of

in the Ctr_row_num for-loop.

• The elements of C

are not initially in the cache, and every time a new set of four rows of A are

“dotted” with a new row of B

, the processor has to wait for C

to arrive in the cache before the

results can be written.

You can address the last two caveats by prefetching to improve performance. However, to efficiently

exploit prefetching, you must structure the code to issue the prefetch instructions such that:

• Enough time is provided for memory requests sent out through prefetch requests to bring data into

the processor’s cache before the data is needed.

• The loops containing the prefetch instructions are ordered to issue sufficient prefetch instructions

to fetch all the pertinent data.

The matrix order of 32 is not a coincidence. A double-precision number consists of 8 bytes. Prefetch

instructions bring memory into the processor in chunks called cache lines consisting of 64 bytes (or

eight double-precision numbers). We need to issue four prefetch instructions to prefetch a row of B

Consequently, when multiplying all 32 rows of A with a particular column of B, we want to arrange

the for-loop that cycles through the rows of A such that it is repeated four times. To achieve this, we

need to dot eight rows of A with a row of B

every time we pass through the Ctr_row_num for-loop.

Additionally, “dotting” eight rows of A upon a row of B

produces eight doubles of C

(that is, a full

cache line).

Assume it takes 60 ns to retrieve data from memory; then we must ensure that at least this much time

elapses between issuing the prefetch instruction and the processor loading that data into its registers.

The dot-product of eight rows of A with a row of B

consists of 512 floating-point operations (dotting

a single row of A with a row of B

consists of 32 additions and 32 multiplications). The

AMD Athlon, AMD Athlon 64, and AMD Opteron processors are capable of performing a maximum

of two floating point operations per clock cycle; therefore, it takes the processor no less than

256 clock cycles to process each Ctr_row_num for-loop.

Choosing a matrix order of 32 is convenient for these reasons:

• All three matrices A, B

, and C

can fit into the processor’s 64-Kbyte L1 data cache.

• On a 2-GHz processor running at full floating-point utilization, 128 ns elapse during the

256 clock cycles, considerably more than the 60 ns to retrieve the data from memory.

previous next

Top Automotive Device Types

Top Automotive Brands

Top Baby Care Device Types

Top Baby Care Brands

Top Car Audio & Video Device Types

Top Car Audio & Video Brands

Top Cellphone Device Types

Top Cellphone Brands

Top Communications Device Types

Top Communications Brands

Top Computer Device Types

Top Computer Brands

Top Fitness Device Types

Top Fitness Brands

Top Home Audio Device Types

Top Home Audio Brands

Top Household Appliance Device Types

Top Household Appliance Brands

Top Kitchen Appliance Device Types

Top Kitchen Appliance Brands

Top Laundry Appliance Device Types

Top Laundry Appliance Brands

Top Lawn & Garden Device Types

Top Lawn & Garden Brands

Top Marine Equipment Device Types

Top Marine Equipment Brands

Top Musical Instrument Device Types

Top Musical Instrument Brands

Top Outdoor Cooking Device Types

Top Outdoor Cooking Brands

Top Personal Care Device Types

Top Personal Care Brands

Top Photography Device Types

Top Photography Brands

Top Portable Media Device Types

Top Portable Media Brands

Top Power Tools Device Types

Top Power Tools Brands

Top TV and Video Device Types

Top TV and Video Brands

Top Videogame Device Types

Top Videogame Brands

AMD 250 Computer Hardware User Manual