AMD 250 Computer Hardware User Manual


 
202 Optimizing with SIMD Instructions Chapter 9
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
The size of each row is an integer number of cache lines.
A set of eight rows of A is dotted in pairs of four with B
T
, and prefetches in each iteration of the
Ctr_row_num for-loop are issued to retrieve:
The cache line (or set of eight double-precision values) of C
T
to be processed in the next iteration
of the Ctr_row_num for-loop.
One quarter of the next row of B
T
.
Including the prefetch to the rows of B
T
increases performance by about 16%. Prefetching the
elements of C
T
increases performance by an additional 3% or so.
Follow these guidelines when working with processor-limited loops:
Arrange your code with enough instructions between prefetches so that there is adequate time for
the data to be retrieved.
Make sure the data that you are prefetching fits into the L1 data cache and does not displace other
data that is also being operated upon. For instance, choosing a larger matrix size might displace A
if all three matrices cannot fit into the 64-Kbyte L1 data cache.
Operate on data in chunks that are integer multiples of cache lines.
Examples
Double-Precision 32 × 32 Matrix Multiplication
//*****************************************************************************
// This routine multiplies a 32x32 matrix A (stored in row-major format) upon
// the transpose of a 32x32 matrix B (stored in row-major format) to get
// the transpose of the resultant 32x32 matrix C.
//*******************************************************************************
void matrix_multiply_32x32(double *A,double *Btranspose,double *Ctranspose) {
int Ctr_8col_blck, Ctr_row_num, n;
// These 4 pointers are used to address 4 consecutive rows of matrix A.
double *Aptr0, *Aptr1, *Aptr2, *Aptr3;
// Pointers *Btr_ptr and *Ctr_ptr are used to address the column of B upon
// which A is being multiplied and where the result C is placed.
// Pointers *Bprefptr and *Cprefptr are used to address the next column
// of B and the next elements of C to be calculated in advance
// using prefetch instructions.
double *Btr_ptr, *Ctr_ptr, *Btr_prefptr, *Ctr_prefptr;
// Put the address of matrices B-tranpose and C-transpose into their
// respective temporary pointers.
Btr_ptr = Btranspose; Ctr_ptr = Ctranspose;
// Shift the prefetch pointers to the next row of B-transpose and the
// next set of 8 elements of C-transpose. (Each set of 8 doubles is
// a 64-byte cache line if the addresses Btr_ptr and Ctr_ptr are aligned
// in memory on 64-byte boundaries.)