Chapter 5 Cache and Memory Optimizations 107
Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005
For more information on write-combining, see Appendix B, “Implementation of Write-Combining.”
Multiple Prefetches
Programmers can initiate multiple outstanding prefetches on the AMD Athlon 64 and AMD Opteron
processors. The AMD Athlon 64 and AMD Opteron processors can have a theoretical maximum of
eight outstanding prefetches, but in practice the number is usually smaller. When all resources are
filled by various memory read requests, the processor waits until resources become free before
processing the next request. Multiple prefetch requests are essentially handled in order, prefetching
data in the order that it is needed.
The following example shows how to initiate multiple prefetches when traversing more than one
array.
Example—Multiple Prefetches
.CODE
.K3D
.686
; Original C code:
;
; #define LARGE_NUM 65536
; #define ARR_SIZE (LARGE_NUM*8)
;
; double array_a[LARGE_NUM];
; double array_b[LARGE_NUM];
; double array_c[LARGE_NUM];
; int i;
;
; for (i = 0; i < LARGE_NUM; i++) {
; a[i] = b[i] * c[i];
; }
mov edx, (-LARGE_NUM) ; Use biased index.
mov eax, OFFSET array_a ; Get address of array_a.
mov ebx, OFFSET array_b ; Get address of array_b.
mov ecx, OFFSET array_c ; Get address of array_c.
loop:
prefetchw [eax+256] ; Four cache lines ahead
prefetch [ebx+256] ; Four cache lines ahead
prefetch [ecx+256] ; Four cache lines ahead
fld QWORD PTR [ebx+edx*8+ARR_SIZE] ; b[i]
fmul QWORD PTR [ecx+edx*8+ARR_SIZE] ; b[i] * c[i]
fstp QWORD PTR [eax+edx*8+ARR_SIZE] ; a[i] = b[i] * c[i]
fld QWORD PTR [ebx+edx*8+ARR_SIZE+8] ; b[i+1]
fmul QWORD PTR [ecx+edx*8+ARR_SIZE+8] ; b[i+1] * c[i+1]
fstp QWORD PTR [eax+edx*8+ARR_SIZE+8] ; a[i+1] = b[i+1] * c[i+1]
fld QWORD PTR [ebx+edx*8+ARR_SIZE+16] ; b[i+2]
fmul QWORD PTR [ecx+edx*8+ARR_SIZE+16] ; b[i+2]*c[i+2]