Chapter 5 Cache and Memory Optimizations 105
Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005
instructions can improve performance. Prefetch instructions only update the L1 data cache and do not
update an architectural register. This uses one less register compared to a load instruction.
Unit-Stride Access
Large data sets typically require unit-stride access to ensure that all data pulled in by a prefetch
instruction is actually used. Large data sets make use of all data that is read from memory, rather than
using only a sparse subset of the memory. If necessary, you should reorganize algorithms or data
structures to allow unit-stride access. For a definition of unit-stride access, see “Definitions” on
page 110.
Hardware Prefetching
The AMD Athlon 64 and AMD Opteron processors implement a hardware prefetching mechanism.
The prefetched data is loaded into the L2 cache. The hardware prefetcher works most efficiently when
data is accessed on a cache-line-by-cache-line basis (that is, without skipping cache lines). Cache
lines on current AMD Athlon 64 and AMD Opteron processors are 64 bytes, but cache-line size is
implementation dependent.
The hardware prefetcher prefetches data that is accessed in an ascending or descending order on a
cache-line-by-cache-line basis. For example, when the hardware prefetcher detects an access to cache
line l followed by an access to cache line l + 1, it initiates a prefetch of cache line l + 3. Accessing
data in increments larger than 64 bytes may fail to trigger the hardware prefetcher because cache lines
are skipped. In these cases, software-prefetch instructions should be employed. Note that in some
earlier revisions of the AMD Athlon 64 and AMD Opteron processors the hardware prefetcher would
only detect ascending accesses.
In some cases, using prefetch instructions on processors with hardware prefetching may slightly
reduce performance. In these cases, it may be necessary to remove the prefetch instructions. All
current AMD Athlon 64 and AMD Opteron processors have hardware prefetching mechanisms.
PREFETCH/W versus PREFETCHNTA/T0/T1/T2
PREFETCHNTA, PREFETCHT0, PREFETCHT1, and PREFETCHT2 are SSE instructions and are
processor-implementation dependent. For the AMD Athlon 64 and AMD Opteron processors, data
that is prefetched with the PREFETCHNTA instruction is not placed into the L2 cache when it is
evicted unless it was originally in L2 when prefetched.
PREFETCHNTA is intended for non-temporal data that will not be needed again soon.
PREFETCHNTA should also be used when reading arrays that are so large that they are larger than
the L2 cache. Because of their size, such large arrays will not be available in L2 even if they are
needed again, and by feeding them through the L2 cache, other possibly useful data will also be
evicted from L2.
Note: The L2 cache size of the processor can be determined by using the CPUID instruction.
Chapters 5 and 9 show examples of how to use the PREFETCHNTA instruction.