IA-32 Intel® Architecture Optimization
2-56
access patterns to suit the hardware prefetcher is highly recommended,
and should be a higher-priority consideration than using software
prefetch instructions.
The hardware prefetcher is best for small-stride data access patterns in
either direction with cache-miss stride not far from 64 bytes. This is true
for data accesses to addresses that are either known or unknown at the
time of issuing the load operations. Software prefetch can complement
the hardware prefetcher if used carefully.
There is a trade-off to make between hardware and software
prefetching. This pertains to application characteristics such as
regularity and stride of accesses. Bus bandwidth, issue bandwidth (the
latency of loads on the critical path) and whether access patterns are
suitable for non-temporal prefetch will also have an impact.
For a detailed description of how to use prefetching, see Chapter 6,
“Optimizing Cache Usage”.
User/Source Coding Rule 9. (M impact, H generality) Enable the prefetch
generation in your compiler. Note: As a compiler’s prefetch implementation
improves, it is expected that its prefetch insertion will outperform manual
insertion except for that done by code tuning experts, but this is not always the
case. If the compiler does not support software prefetching, intrinsics or inline
assembly may be used to manually insert prefetch instructions.
Chapter 6 contains an example of using software prefetch to implement
memory copy algorithm.
Tuning Suggestion 2. If a load is found to miss frequently, either insert a
prefetch before it, or, if issue bandwidth is a concern, move the load up to
execute earlier.
Cacheability Instructions
SSE2 provides additional cacheability instructions that extend further
from the cacheability instructions provided in SSE. The new
cacheability instructions include:
• new streaming store instructions