Support User Manuals

AMD 250 Computer Hardware User Manual

Open as PDF

of 384

108 Cache and Memory Optimizations Chapter 5

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

fstp QWORD PTR [eax+edx*8+ARR_SIZE+16] ; a[i+2] = [i+2] * c[i+2]

fld QWORD PTR [ebx+edx*8+ARR_SIZE+24] ; b[i+3]

fmul QWORD PTR [ecx+edx*8+ARR_SIZE+24] ; b[i+3] * c[i+3]

fstp QWORD PTR [eax+edx*8+ARR_SIZE+24] ; a[i+3] = b[i+3] * c[i+3]

fld QWORD PTR [ebx+edx*8+ARR_SIZE+32] ; b[i+4]

fmul QWORD PTR [ecx+edx*8+ARR_SIZE+32] ; b[i+4] * c[i+4]

fstp QWORD PTR [eax+edx*8+ARR_SIZE+32] ; a[i+4] = b[i+4] * c[i+4]

fld QWORD PTR [ebx+edx*8+ARR_SIZE+40] ; b[i+5]

fmul QWORD PTR [ecx+edx*8+ARR_SIZE+40] ; b[i+5] * c[i+5]

fstp QWORD PTR [eax+edx*8+ARR_SIZE+40] ; a[i+5] = b[i+5] * c[i+5]

fld QWORD PTR [ebx+edx*8+ARR_SIZE+48] ; b[i+6]

fmul QWORD PTR [ecx+edx*8+ARR_SIZE+48] ; b[i+6] * c[i+6]

fstp QWORD PTR [eax+edx*8+ARR_SIZE+48] ; a[i+6] = b[i+6] * c[i+6]

fld QWORD PTR [ebx+edx*8+ARR_SIZE+56] ; b[i+7]

fmul QWORD PTR [ecx+edx*8+ARR_SIZE+56] ; b[i+7] * c[i+7]

fstp QWORD PTR [eax+edx*8+ARR_SIZE+56] ; a[i+7] = b[i+7] * c[i+7]

add edx, 8 ; Compute next 8 products

jnz loop ; until none left.

END

The following optimization rules are applied to this example:

• Partially unroll loops to ensure that the data stride per loop iteration is equal to the length of a

cache line. This avoids overlapping PREFETCH instructions and thus makes optimal use of the

available number of outstanding prefetches.

• Because the array array_a is written rather than read, use PREFETCHW instead of PREFETCH

to avoid overhead for switching cache lines to the correct state. The prefetch distance is optimized

such that each loop iteration is working on three cache lines while active prefetches bring in the

next cache lines.

• Reduce index arithmetic to a minimum by use of complex addressing modes and biasing of the

array base addresses in order to cut down on loop overhead.

Determining Prefetch Distance

When determining how far ahead to prefetch, the basic guideline is to initiate the prefetch early

enough so that the data is in the cache by the time it is needed, under the constraint that there can not

be more than eight prefetches in flight at any given time.

To determine the optimal prefetch distance, use empirical benchmarking when possible. Prefetching

three or four cache lines ahead (192 or 256 bytes) is a good starting point and usually gives good

results. Trying to prefetch too far ahead impairs performance.

Memory-Limited versus Processor-Limited Code

Software prefetching can help to hide the memory latency, but it can not increase the total memory

bandwidth. Many loops are limited by memory bandwidth rather than processor speed, as shown in

Figure 4. In these cases, the best that software prefetching can do is to ensure that enough memory

previous next