Optimizing Cache Usage 6
6-39
Table 6-1 summarizes the steps of the basic usage model that
incorporates only software prefetch with strip-mining. The steps are:
• Do strip-mining: partition loops so that the dataset fits into
second-level cache.
• Use prefetchnta if the data is only used once or the dataset fits
into 32K (one way of second-level cache). Use
prefetcht0 if the
dataset exceeds 32K.
The above steps are platform-specific and provide an implementation
example. The variables
NUM_STRIP and MAX_NUM_VX_PER_STRIP can be
heuristically determined for peak performance for specific application
on a specific platform.
Hardware Prefetching and Cache Blocking Techniques
Tuning data access patterns for the automatic hardware prefetch
mechanism can minimize the memory access costs of the first-pass of
the read-multiple-times and some of the read-once memory references.
An example of the situations of read-once memory references can be
illustrated with a matrix or image transpose, reading from a column-first
orientation and writing to a row-first orientation, or vice versa.
Example 6-9 shows a nested loop of data movement that represents a
typical matrix/image transpose problem. If the dimension of the array
are large, not only the footprint of the dataset will exceed the last level
cache but cache misses will occur at large strides. If the dimensions
Table 6-1 Software Prefetching Considerations into Strip-mining Code
Read-Once Array
References
Read-Multiple-Times Array References
Adjacent Passes Non-Adjacent Passes
Prefetchnta Prefetch0, SM1 Prefetch0, SM1
(2nd Level Pollution)
Evict one way; Minimize
pollution
Pay memory access cost for the
first pass of each array;
Amortize the first pass with
subsequent passes
Pay memory access cost for
the first pass of every strip;
Amortize the first pass with
subsequent passes