AMD 250 Computer Hardware User Manual


 
200 Optimizing with SIMD Instructions Chapter 9
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
9.5 Structuring Code with Prefetch Instructions to
Hide Memory Latency
Optimization
When utilizing prefetch instructions, attend to:
The time allotted (latency) for data to reach the processor between issuing a prefetch instruction
and using the data.
Structuring the code to best take advantage of prefetching.
Application
This optimization applies to:
32-bit software
64-bit software
Rationale
Prefetch instructions bring the cache line containing a specified memory location into the processor
cache. (For more information on prefetch instructions, see “Prefetch Instructions” on page 104.)
Prefetching hides the main memory load latency, which is typically many orders of magnitude larger
than a processor clock cycle.
There are two types of loops:
The example provided below illustrates the importance of the above considerations in an example that
multiplies a double-precision 32 × 32 matrix A with another 32 × 32 transposed double-precision
matrix, B
T
; the result is returned in another 32 × 32 transposed double-precision matrix, C
T
. (The
transposition of B and C is performed to efficiently access their elements because matrices in the C
programming language are stored in row-major format. Doing the transposition in advance reduces
the problem of matrix multiplication to one of computing several dot-products—one for each element
of the results matrix, C
T
. This “dotting” operation is implemented as the sum of pair-wise products of
the elements of two equal-length vectors.) For this example, assume the processor clock speed is
2 GHz, and the memory latency is 60 ns. In this example, the rows of matrix A are repeatedly
Loop type Description
Memory-limited Data can be processed and requested faster than it can be fetched from memory.
Processor-limited Data can be requested and brought into the processor before it is needed because
considerable processing occurs during each unrolled loop iteration.