Intel IA-32 Computer Accessories User Manual


 
IA-32 Intel® Architecture Optimization
6-26
lines of data per iteration. The PSD would need to be
increased/decreased if more/less than two cache lines are used per
iteration.
Software Prefetch Concatenation
Maximum performance can be achieved when execution pipeline is at
maximum throughput, without incurring any memory latency penalties.
This can be achieved by prefetching data to be used in successive
iterations in a loop. De-pipelining memory generates bubbles in the
execution pipeline. To explain this performance issue, a 3D geometry
pipeline that processes 3D vertices in strip format is used as an example.
A strip contains a list of vertices whose predefined vertex order forms
contiguous triangles. It can be easily observed that the memory pipe is
de-pipelined on the strip boundary due to ineffective prefetch
arrangement. The execution pipeline is stalled for the first two iterations
for each strip. As a result, the average latency for completing an
iteration will be 165(FIX) clocks. (See Appendix E, “Mathematics of
Prefetch Scheduling Distance”, for a detailed memory pipeline
description.)
Example 6-3 Prefetch Scheduling Distance
top_loop:
prefetchnta [edx + esi + 128*3]
prefetchnta [edx*4 + esi + 128*3]
. . . . .
movaps xmm1, [edx + esi]
movaps xmm2, [edx*4 + esi]
movaps xmm3, [edx + esi + 16]
movaps xmm4, [edx*4 + esi + 16]
. . . . .
. . . . .
add esi, 128
cmp esi, ecx
jl top_loop