Chapter 5 Cache and Memory Optimizations 109
Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005
requests are “in flight” to keep the memory system busy all of the time. The AMD Athlon 64 and
AMD Opteron processors support a maximum of eight concurrent memory requests to different cache
lines. Multiple requests to the same cache line count as only one towards this limit of eight.
Figure 4. Memory-Limited Code
Code that performs many computations on each cache line is limited by processor speed rather than
memory bandwidth, as shown in Figure 5. In this case, the goal of software prefetching is just to
ensure that the memory data is available when the processor needs it. As the processor speed
increases, the optimal prefetch distance increases until the memory bandwidth becomes the limiting
factor.
For an example of how to use software prefetching in processor-limited code, see “Structuring Code
with Prefetch Instructions to Hide Memory Latency” on page 200.
Figure 5. Processor-Limited Code
M1 M5M2 M3 M4
C0 C1 C2 C3 C4
Total Memory Latency
Prefetchnta [ esi + 64 *
4
]
memory burst time
(one 64-byte cache line)
Memor y
cycles
CPU
l oops
Prefetch distance is
~4 cache lines ahead
time
…
…
M1 M5M2 M3 M4
C0 C1 C2 C3 C4
Total Memory Latency
Prefetchnta [ esi + 64 *
4
]
memory burst time
(one 64-byte cache line)
Memor y
cycles
CPU
l oops
Prefetch distance is
~4 cache lines ahead
time
…
…
C1 C5C2 C3 C4
M1 M2 M3 M4 M5
Total Memory Latency
Prefetchnta [ esi + 64 * 2]
memory burst time
(one 64-byte cache line)
Memor y
cycles
CPU
l oops
Prefetch distance is
~2 cache lines ahead
(maybe use 3 for safety)
time
CPU time
(process one cache line)
…
…
C1 C5C2 C3 C4
M1 M2 M3 M4 M5
Total Memory Latency
Prefetchnta [ esi + 64 * 2]
memory burst time
(one 64-byte cache line)
Memor y
cycles
CPU
l oops
Prefetch distance is
~2 cache lines ahead
(maybe use 3 for safety)
time
CPU time
(process one cache line)
…
…