IA-32 Intel® Architecture Processor Family Overview
1-23
Hardware prefetching for Pentium 4 processor has the following
characteristics:
• works with existing applications
• does not require extensive study of prefetch instructions
• requires regular access patterns
• avoids instruction and issue port bandwidth overhead
• has a start-up penalty before the hardware prefetcher triggers and
begins initiating fetches
The hardware prefetcher can handle multiple streams in either the
forward or backward directions. The start-up delay and fetch-ahead has
a larger effect for short arrays when hardware prefetching generates a
request for data beyond the end of an array (not actually utilized). The
hardware penalty diminishes if it is amortized over longer arrays.
Hardware prefetching is triggered after two successive cache misses in
the last level cache and requires these cache misses to satisfy a condition
that the linear address distance between these cache misses is within a
threshold value. The threshold value depends on the processor
implementation of the microarchitecture (see Table 1-2). However,
hardware prefetching will not cross 4KB page boundaries. As a result,
hardware prefetching can be very effective when dealing with cache
miss patterns that have small strides that are significantly less than half
the threshold distance to trigger hardware prefetching. On the other
hand, hardware prefetching will not benefit cache miss patterns that
have frequent DTLB misses or have access strides that cause successive
cache misses that are spatially apart by more than the trigger threshold
distance.
Software can proactively control data access pattern to favor smaller
access strides (e.g., stride that is less than half of the trigger threshold
distance) over larger access strides (stride that is greater than the trigger
threshold distance), this can achieve additional benefit of improved
temporal locality and reducing cache misses in the last level cache
significantly.