AMD 250 Computer Hardware User Manual


 
104 Cache and Memory Optimizations Chapter 5
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
5.6 Prefetch Instructions
Optimization
Where appropriate, use one of the prefetch instructions to increase the effective bandwidth of the
AMD Athlon 64 and AMD Opteron processors.
Application
This optimization applies to:
32-bit software
64-bit software
Rationale
Prefetch instructions take advantage of the high bus bandwidth of the AMD Athlon 64 and
AMD Opteron processors to hide latencies when fetching data from system memory. A prefetch
instruction initiates a read request of a specified address and reads the entire cache line that contains
that address.
AMD Athlon 64 and AMD Opteron processors perform three types of prefetches:
The prefetch instructions can be used anywhere, in any type of code. The use of prefetch instructions
is not affected by the values of Control Register 0 (CR0) bits, such as CR0.EM and CR0.TS.
Prefetching versus Preloading
In code that makes irregular memory accesses rather than sequential accesses, an ordinary MOV
instruction is the best way to load data. But in situations where sequential addresses are read, prefetch
Prefetch type Description
Load Reads the data into the L1 data cache; the data is later evicted to the L2 cache. The
following instructions perform load prefetches: PREFETCH, PREFETCHT0,
PREFETCHT1, and PREFETCHT2.
Store Reads the data into the L1 data cache and marks the data as modified; the data is
later evicted to the L2 cache. The PREFETCHW instruction performs a store prefetch.
Nontemporal The PREFETCHNTA instruction performs a nontemporal prefetch. The data is read
into the L1 data cache; to avoid cache pollution, when a PREFETCHNTA misses in
the L2 cache and reads from memory, the data is never evicted to the L2 cache. When
a PREFETCHNTA hits in the L2 cache, the data is evicted back to the L2 cache. AMD
Athlon 64 and AMD Opteron processors prior to Revision E read data into one way of
the L1 cache when the PREFETCHNTA instruction was used. Revision E processors
read PREFETCHNTA data into both ways of the L1 cache.