Developer’s Manual March, 2003 B-27
Intel
®
80200 Processor based on Intel
®
XScale
™
Microarchitecture
Optimization Guide
B.4.4.2. Prefetch Loop Scheduling
When adding prefetch to a loop which operates on arrays, it may be advantages to prefetch ahead
one, two, or more iterations. The data for future iterations is located in memory by a fixed offset
from the data for the current iteration. This makes it easy to predict where to fetch the data. The
number of iterations to prefetch ahead is refereed to as the prefetch scheduling distance or psd. For
the Intel
®
80200 processor this can be calculated as:
Where:
N
pref
Is the number of cache lines to be prefetched for both reading and writing.
N
evict
Is the number of cache half line evictions caused by the loop.
N
inst
Is the number of instructions executed in one iteration of the loop
N
hwlinexfer
This is the number of core clocks required to write half a cache line as would happen if
only one of the cache line dirty bits were set when a line eviction occurred. For the
Intel
®
80200 processor this takes 2 bus clocks or 12 core clocks.
CPI This is the average number of core clocks per instruction.
The psd number provided by the above equation is a good starting point, but may not be the most
ideal consideration. Estimating N
evict
is very difficult from static code. However, if the operational
data uses the mini-data cache and if the loop operations should overflow the mini-data cache, then
a first order estimate of N
evict
would be the number of bytes written pre loop iteration divided by a
half cache line size of 16 bytes. Cache overflow can be estimated by the number of cache lines
transferred each iteration and the number of expected loop iterations. N
evict
and CPI can be
estimated by profiling the code using the performance monitor “cache write-back” event count.
B.4.4.3. Prefetch Loop Limitations
It is not always advantages to add prefetch to a loop. Loop characteristics that limit the use value of
prefetch are discussed below.
B.4.4.4. Compute vs. Data Bus Bound
At the extreme, a loop, which is data bus bound, does not benefit from prefetch because all the
system resources to transfer data are quickly allocated and there are no instructions that can
profitably be executed. On the other end of the scale, compute bound loops allow complete hiding
of all data transfer latencies.
B.4.4.5. Low Number of Iterations
Loops with very low iteration counts may have the advantages of prefetch completely mitigated. A
loop with a small fixed number of iterations may be faster if the loop is completely unrolled rather
than trying to schedule prefetch instructions.
psd floor
N
lookup
N
linexfer
N
pref
× N
hwlinexfer
N
evict
×++()
CPI N
inst
×()
----------------------------------------------------------------------------------------------------------------------------
=