Support User Manuals

Intel IA-32 Computer Accessories User Manual

Open as PDF

of 568

Optimizing Cache Usage 6

6-39

Table 6-1 summarizes the steps of the basic usage model that

incorporates only software prefetch with strip-mining. The steps are:

• Do strip-mining: partition loops so that the dataset fits into

second-level cache.

• Use prefetchnta if the data is only used once or the dataset fits

into 32K (one way of second-level cache). Use

prefetcht0 if the

dataset exceeds 32K.

The above steps are platform-specific and provide an implementation

example. The variables

NUM_STRIP and MAX_NUM_VX_PER_STRIP can be

heuristically determined for peak performance for specific application

on a specific platform.

Hardware Prefetching and Cache Blocking Techniques

Tuning data access patterns for the automatic hardware prefetch

mechanism can minimize the memory access costs of the first-pass of

the read-multiple-times and some of the read-once memory references.

An example of the situations of read-once memory references can be

illustrated with a matrix or image transpose, reading from a column-first

orientation and writing to a row-first orientation, or vice versa.

Example 6-9 shows a nested loop of data movement that represents a

typical matrix/image transpose problem. If the dimension of the array

are large, not only the footprint of the dataset will exceed the last level

cache but cache misses will occur at large strides. If the dimensions

Table 6-1 Software Prefetching Considerations into Strip-mining Code

Read-Once Array

References

Read-Multiple-Times Array References

Adjacent Passes Non-Adjacent Passes

Prefetchnta Prefetch0, SM1 Prefetch0, SM1

(2nd Level Pollution)

Evict one way; Minimize

pollution

Pay memory access cost for the

first pass of each array;

Amortize the first pass with

subsequent passes

Pay memory access cost for

the first pass of every strip;

Amortize the first pass with

subsequent passes

previous next