IA-32 Intel® Architecture Optimization
6-4
• Optimize software prefetch scheduling distance:
— Far ahead enough to allow interim computation to overlap
memory access time.
— Near enough that the prefetched data is not replaced from the
data cache.
• Use software prefetch concatenation:
— Arrange prefetches to avoid unnecessary prefetches at the end
of an inner loop and to prefetch the first few iterations of the
inner loop inside the next outer loop.
• Minimize the number of software prefetches:
— Prefetch instructions are not completely free in terms of bus
cycles, machine cycles and resources; excessive usage of
prefetches can adversely impact application performance.
• Interleave prefetch with computation instructions:
— For best performance, software prefetch instructions must be
interspersed with other computational instructions in the
instruction sequence rather than clustered together.
Hardware Prefetching of Data
The Pentium 4, Intel Xeon, Pentium M, Intel Core Solo and Intel Core
Duo processors implement a hardware automatic data prefetcher which
monitors application data access patterns and prefetches data
automatically. This behavior is automatic and does not require
programmer’s intervention directly.
Characteristics of the hardware data prefetcher for the Pentium 4 and
Intel Xeon processors are:
1. Requires two successive cache misses in the last level cache to
trigger the mechanism and these two cache misses satisfying the
condition that the strides of the cache misses is less than the trigger
distance of the hardware prefetch mechanism (see Table 1-2).
2. Attempts to stay 256 bytes ahead of current data access locations