Intel IA-32 Computer Accessories User Manual


 
IA-32 Intel® Architecture Optimization
6-24
The performance loss caused by poor utilization of resources can be
completely eliminated by correctly scheduling the prefetch instructions
appropriately. As shown in Figure 6-3, prefetch instructions are issued
two vertex iterations ahead. This assumes that only one vertex gets
processed in one iteration and a new data cache line is needed for each
iteration. As a result, when iteration n, vertex V
n
, is being processed, the
requested data is already brought into cache. In the meantime, the
front-side bus is transferring the data needed for iteration n+1, vertex
V
n+1
. Because there is no dependence between V
n+1
data and the
execution of V
n
, the latency for data access of V
n+1
can be entirely
hidden behind the execution of V
n
. Under such circumstances, no
“bubbles” are present in the pipelines and thus the best possible
performance can be achieved.
Prefetching is useful for inner loops that have heavy computations, or
are close to the boundary between being compute-bound and
memory-bandwidth-bound.
The prefetch is probably not very useful for loops which are
predominately memory bandwidth-bound.
When data is already located in the first level cache, prefetching can be
useless and could even slow down the performance because the extra
µops either back up waiting for outstanding memory accesses or may be
dropped altogether. This behavior is platform-specific and may change
in the future.
Software Prefetching Usage Checklist
The following checklist covers issues that need to be addressed and/or
resolved to use the software prefetch instruction properly:
Determine software prefetch scheduling distance
Use software prefetch concatenation
Minimize the number of software prefetches
Mix software prefetch with computation instructions
Use cache blocking techniques (for example, strip mining)