Support User Manuals

Intel IA-32 Computer Accessories User Manual

Open as PDF

of 568

IA-32 Intel® Architecture Optimization

6-26

lines of data per iteration. The PSD would need to be

increased/decreased if more/less than two cache lines are used per

iteration.

Software Prefetch Concatenation

Maximum performance can be achieved when execution pipeline is at

maximum throughput, without incurring any memory latency penalties.

This can be achieved by prefetching data to be used in successive

iterations in a loop. De-pipelining memory generates bubbles in the

execution pipeline. To explain this performance issue, a 3D geometry

pipeline that processes 3D vertices in strip format is used as an example.

A strip contains a list of vertices whose predefined vertex order forms

contiguous triangles. It can be easily observed that the memory pipe is

de-pipelined on the strip boundary due to ineffective prefetch

arrangement. The execution pipeline is stalled for the first two iterations

for each strip. As a result, the average latency for completing an

iteration will be 165(FIX) clocks. (See Appendix E, “Mathematics of

Prefetch Scheduling Distance”, for a detailed memory pipeline

description.)

Example 6-3 Prefetch Scheduling Distance

top_loop:

prefetchnta [edx + esi + 128*3]

prefetchnta [edx*4 + esi + 128*3]

. . . . .

movaps xmm1, [edx + esi]

movaps xmm2, [edx*4 + esi]

movaps xmm3, [edx + esi + 16]

movaps xmm4, [edx*4 + esi + 16]

. . . . .

. . . . .

add esi, 128

cmp esi, ecx

jl top_loop

previous next