AMD 250 Computer Hardware User Manual


 
110 Cache and Memory Optimizations Chapter 5
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
Definitions
Unit-stride access refers to a memory access pattern where consecutive memory accesses are made to
consecutive array elements, in ascending or descending order. If the arrays are made of elemental
types, then they imply adjacent memory locations as well. For example:
char j, k[MAX];
for (i = 0; i < MAX; i++) {
...
j += k[i]; // Every byte is used.
...
}
double x, y[MAX];
for (i = 0; i < MAX; i++) {
...
x += y[i]; // Every byte is used.
...
}
Exception to Unit Stride
The unit-stride concept works well when stepping through arrays of elementary data types. In some
instances, unit stride alone may not be sufficient to determine how to use the PREFETCH instruction
properly. For example, assume that there is a vertex structure of 256 bytes and the code steps through
the vertices in unit stride, but using only the x, y, z, w components, each being of type float (for
example, the first 16 bytes of each vertex). In this case, the prefetch distance obviously should be
some function of the data size structure (for a properly chosen n):
prefetch [eax+n*structure_size]
...
add eax, structure_size
You should experiment to find the optimal prefetch distance; there is no formula that works for all
situations.
Data Stride per Loop Iteration
Assuming unit-stride access to a single array, the data stride of a loop (the loop stride) refers to the
number of bytes accessed in the array per loop iteration. For example:
fldz
add_loop:
fadd QWORD PTR [ebx*8+base_address]
dec ebx
jnz add_loop
The data stride of the above loop is eight bytes. In general, for optimal use of prefetching, the data
stride per iteration is the length of a cache line (64 bytes in the AMD Athlon 64 and AMD Opteron
processors). If the loop stride is smaller, unroll the loop enough to use a whole cache line of data per