Intel IA-32 Computer Accessories User Manual


 
IA-32 Intel® Architecture Optimization
2-50
be no RFO since the line is not cached, and there is no such delay. For
details on write-combining, see the Intel Architecture Software Devel-
oper’s Manual.
Locality Enhancement
Locality enhancement can reduce data traffic originating from an
outer-level sub-system in the cache/memory hierarchy, this is to address
the fact that the access-cost in terms of cycle-count from an outer level
will be more expensive than from an inner level. Typically, the
cycle-cost of accessing a given cache level (or memory system) varies
across different microarchitecture, processor implementations, and
platform components. It may be sufficient to recognize the relative data
access cost trend by locality rather than to follow a large table of
numeric values of cycle-costs, listed per locality, per processor/platform
implementations, etc. The general trend is typically that access cost
from an outer sub-system may be somewhere between 3-10X more
expensive than accessing data from the immediate inner level in the
cache/memory hierarchy, assuming similar degrees of data access
parallelism.
Thus locality enhancement should start with characterizing the
dominant data traffic locality. “Workload Characterization” in
Appendix A describes some techniques that can be used to determine
the dominant data traffic locality for any workload.
Even if cache miss rates of the last level cache may be low relative to
the number of cache references, processors typically spend a sizable
portion of their execution time waiting for cache misses to be serviced.
Reducing cache misses by enhancing a program’s locality is a key
optimization. This can take several forms:
blocking to iterate over a portion of an array that will fit in the cache
(with the purpose that subsequent references to the data-block (or
tile) will be cache hit references)
loop interchange to avoid crossing cache lines or page boundaries
loop skewing to make accesses contiguous