202 Optimizing with SIMD Instructions Chapter 9
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
• The size of each row is an integer number of cache lines.
A set of eight rows of A is dotted in pairs of four with B
T
, and prefetches in each iteration of the
Ctr_row_num for-loop are issued to retrieve:
• The cache line (or set of eight double-precision values) of C
T
to be processed in the next iteration
of the Ctr_row_num for-loop.
• One quarter of the next row of B
T
.
Including the prefetch to the rows of B
T
increases performance by about 16%. Prefetching the
elements of C
T
increases performance by an additional 3% or so.
Follow these guidelines when working with processor-limited loops:
• Arrange your code with enough instructions between prefetches so that there is adequate time for
the data to be retrieved.
• Make sure the data that you are prefetching fits into the L1 data cache and does not displace other
data that is also being operated upon. For instance, choosing a larger matrix size might displace A
if all three matrices cannot fit into the 64-Kbyte L1 data cache.
• Operate on data in chunks that are integer multiples of cache lines.
Examples
Double-Precision 32 × 32 Matrix Multiplication
//*****************************************************************************
// This routine multiplies a 32x32 matrix A (stored in row-major format) upon
// the transpose of a 32x32 matrix B (stored in row-major format) to get
// the transpose of the resultant 32x32 matrix C.
//*******************************************************************************
void matrix_multiply_32x32(double *A,double *Btranspose,double *Ctranspose) {
int Ctr_8col_blck, Ctr_row_num, n;
// These 4 pointers are used to address 4 consecutive rows of matrix A.
double *Aptr0, *Aptr1, *Aptr2, *Aptr3;
// Pointers *Btr_ptr and *Ctr_ptr are used to address the column of B upon
// which A is being multiplied and where the result C is placed.
// Pointers *Bprefptr and *Cprefptr are used to address the next column
// of B and the next elements of C to be calculated in advance
// using prefetch instructions.
double *Btr_ptr, *Ctr_ptr, *Btr_prefptr, *Ctr_prefptr;
// Put the address of matrices B-tranpose and C-transpose into their
// respective temporary pointers.
Btr_ptr = Btranspose; Ctr_ptr = Ctranspose;
// Shift the prefetch pointers to the next row of B-transpose and the
// next set of 8 elements of C-transpose. (Each set of 8 doubles is
// a 64-byte cache line if the addresses Btr_ptr and Ctr_ptr are aligned
// in memory on 64-byte boundaries.)