Mathematics of Prefetch Scheduling Distance E
E-3
T
b
data transfer latency which is equal to number of lines
per iteration * line burst latency
Note that the potential effects of µop reordering are not factored into the
estimations discussed.
Examine Example E-1 that uses the
prefetchnta instruction with a
prefetch scheduling distance of 3, that is, psd = 3. The data prefetched in
iteration i, will actually be used in iteration i+3. T
c
represents the cycles
needed to execute
top_loop - assuming all the memory accesses hit L1
while il (iteration latency) represents the cycles needed to execute this
loop with actually run-time memory footprint. T
c
can be determined by
computing the critical path latency of the code dependency graph. This
work is quite arduous without help from special performance
characterization tools or compilers. A simple heuristic for estimating the
T
c
value is to count the number of instructions in the critical path and
multiply the number with an artificial CPI. A reasonable CPI value
would be somewhere between 1.0 and 1.5 depending on the quality of
code scheduling.
Example E-1 Calculating Insertion for Scheduling Distance of 3
top_loop:
prefetchnta [edx+esi+32*3]
prefetchnta [edx*4+esi+32*3]
. . . . .
movaps xmm1, [edx+esi]
movaps xmm2, [edx*4+esi]
movaps xmm3, [edx+esi+16]
movaps xmm4, [edx*4+esi+16]
. . . . .
. . .
add esi, 32
cmp esi, ecx
jl top_loop