108 Cache and Memory Optimizations Chapter 5
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
fstp QWORD PTR [eax+edx*8+ARR_SIZE+16] ; a[i+2] = [i+2] * c[i+2]
fld QWORD PTR [ebx+edx*8+ARR_SIZE+24] ; b[i+3]
fmul QWORD PTR [ecx+edx*8+ARR_SIZE+24] ; b[i+3] * c[i+3]
fstp QWORD PTR [eax+edx*8+ARR_SIZE+24] ; a[i+3] = b[i+3] * c[i+3]
fld QWORD PTR [ebx+edx*8+ARR_SIZE+32] ; b[i+4]
fmul QWORD PTR [ecx+edx*8+ARR_SIZE+32] ; b[i+4] * c[i+4]
fstp QWORD PTR [eax+edx*8+ARR_SIZE+32] ; a[i+4] = b[i+4] * c[i+4]
fld QWORD PTR [ebx+edx*8+ARR_SIZE+40] ; b[i+5]
fmul QWORD PTR [ecx+edx*8+ARR_SIZE+40] ; b[i+5] * c[i+5]
fstp QWORD PTR [eax+edx*8+ARR_SIZE+40] ; a[i+5] = b[i+5] * c[i+5]
fld QWORD PTR [ebx+edx*8+ARR_SIZE+48] ; b[i+6]
fmul QWORD PTR [ecx+edx*8+ARR_SIZE+48] ; b[i+6] * c[i+6]
fstp QWORD PTR [eax+edx*8+ARR_SIZE+48] ; a[i+6] = b[i+6] * c[i+6]
fld QWORD PTR [ebx+edx*8+ARR_SIZE+56] ; b[i+7]
fmul QWORD PTR [ecx+edx*8+ARR_SIZE+56] ; b[i+7] * c[i+7]
fstp QWORD PTR [eax+edx*8+ARR_SIZE+56] ; a[i+7] = b[i+7] * c[i+7]
add edx, 8 ; Compute next 8 products
jnz loop ; until none left.
END
The following optimization rules are applied to this example:
• Partially unroll loops to ensure that the data stride per loop iteration is equal to the length of a
cache line. This avoids overlapping PREFETCH instructions and thus makes optimal use of the
available number of outstanding prefetches.
• Because the array array_a is written rather than read, use PREFETCHW instead of PREFETCH
to avoid overhead for switching cache lines to the correct state. The prefetch distance is optimized
such that each loop iteration is working on three cache lines while active prefetches bring in the
next cache lines.
• Reduce index arithmetic to a minimum by use of complex addressing modes and biasing of the
array base addresses in order to cut down on loop overhead.
Determining Prefetch Distance
When determining how far ahead to prefetch, the basic guideline is to initiate the prefetch early
enough so that the data is in the cache by the time it is needed, under the constraint that there can not
be more than eight prefetches in flight at any given time.
To determine the optimal prefetch distance, use empirical benchmarking when possible. Prefetching
three or four cache lines ahead (192 or 256 bytes) is a good starting point and usually gives good
results. Trying to prefetch too far ahead impairs performance.
Memory-Limited versus Processor-Limited Code
Software prefetching can help to hide the memory latency, but it can not increase the total memory
bandwidth. Many loops are limited by memory bandwidth rather than processor speed, as shown in
Figure 4. In these cases, the best that software prefetching can do is to ensure that enough memory