Intel IA-32 Computer Accessories User Manual


 
Optimizing Cache Usage 6
6-19
Hardware Prefetch
The automatic hardware prefetch, can bring cache lines into the unified
last-level cache based on prior data misses. The automatic hardware
prefetcher will attempt to prefetch two cache lines ahead of the prefetch
stream. This feature is introduced with the Pentium 4 processor.
The characteristics of the hardware prefetching are as follows:
Requires some regularity in the data access patterns:
if a data access pattern has constant stride, hardware prefetching
is effective only if access stride is less than half of the trigger
distance of hardware prefetcher (see Table 1-2).
if access stride is not constant, the automatic hardware
prefetcher can mask memory latency if the strides of two
successive cache misses are less than the trigger threshold
distance (small-stride memory traffic).
the automatic hardware prefetcher is most effective if the
strides of two successive cache misses remain less than the
trigger threshold distance and close to 64 bytes.
Start-up penalty before hardware prefetcher triggers and extra
fetches after array finishes. For short arrays this overhead can
reduce effectiveness of the hardware prefetcher.
The hardware prefetcher requires a couple misses before it
starts operating.
Hardware prefetching will generate a request for data beyond
the end of an array, which will not be utilized. This behavior
wastes bus bandwidth. In addition this behavior results in a
start-up penalty when fetching the beginning of the next array;
this occurs because the wasted prefetch should have been used
instead to hide the latency for the initial data in the next array.
Software prefetching can recognize and handle these cases.
Will not prefetch across a 4K page boundary; i.e., the program
would have to initiate demand loads for the new page before the
hardware prefetcher will start prefetching from the new page.