AMD 250 Computer Hardware User Manual


 
Chapter 9 Optimizing with SIMD Instructions 201
Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005
“dotted” with a column of B
T
. Once this is done, the rows of matrix A are “dotted” with the next
column of B
T
, and the process is repeated through all the columns of B
T
.
From a performance standpoint, there are several caveats to recognize, as follows:
Once all the rows of A have been multiplied with the first column of B, all the rows of A are in the
cache, and subsequent accesses to them do not cause cache misses.
The rows of B
T
are brought into the cache by “dotting” the first four rows of A with each row of
B
T
in the Ctr_row_num for-loop.
The elements of C
T
are not initially in the cache, and every time a new set of four rows of A are
“dotted” with a new row of B
T
, the processor has to wait for C
T
to arrive in the cache before the
results can be written.
You can address the last two caveats by prefetching to improve performance. However, to efficiently
exploit prefetching, you must structure the code to issue the prefetch instructions such that:
Enough time is provided for memory requests sent out through prefetch requests to bring data into
the processor’s cache before the data is needed.
The loops containing the prefetch instructions are ordered to issue sufficient prefetch instructions
to fetch all the pertinent data.
The matrix order of 32 is not a coincidence. A double-precision number consists of 8 bytes. Prefetch
instructions bring memory into the processor in chunks called cache lines consisting of 64 bytes (or
eight double-precision numbers). We need to issue four prefetch instructions to prefetch a row of B
T
.
Consequently, when multiplying all 32 rows of A with a particular column of B, we want to arrange
the for-loop that cycles through the rows of A such that it is repeated four times. To achieve this, we
need to dot eight rows of A with a row of B
T
every time we pass through the Ctr_row_num for-loop.
Additionally, “dotting” eight rows of A upon a row of B
T
produces eight doubles of C
T
(that is, a full
cache line).
Assume it takes 60 ns to retrieve data from memory; then we must ensure that at least this much time
elapses between issuing the prefetch instruction and the processor loading that data into its registers.
The dot-product of eight rows of A with a row of B
T
consists of 512 floating-point operations (dotting
a single row of A with a row of B
T
consists of 32 additions and 32 multiplications). The
AMD Athlon, AMD Athlon 64, and AMD Opteron processors are capable of performing a maximum
of two floating point operations per clock cycle; therefore, it takes the processor no less than
256 clock cycles to process each Ctr_row_num for-loop.
Choosing a matrix order of 32 is convenient for these reasons:
All three matrices A, B
T
, and C
T
can fit into the processor’s 64-Kbyte L1 data cache.
On a 2-GHz processor running at full floating-point utilization, 128 ns elapse during the
256 clock cycles, considerably more than the 60 ns to retrieve the data from memory.