IA-32 Intel® Architecture Optimization
6-22
Example of Latency Hiding with S/W Prefetch Instruction
Achieving the highest level of memory optimization using prefetch
instructions requires an understanding of the microarchitecture and
system architecture of a given machine. This section translates the key
architectural implications into several simple guidelines for
programmers to use.
Figure 6-2 and Figure 6-3 show two scenarios of a simplified 3D
geometry pipeline as an example. A 3D-geometry pipeline typically
fetches one vertex record at a time and then performs transformation
and lighting functions on it. Both figures show two separate pipelines,
an execution pipeline, and a memory pipeline (front-side bus).
Since the Pentium 4 processor, similarly to the Pentium II and
Pentium III processors, completely decouples the functionality of
execution and memory access, these two pipelines can function
concurrently. Figure 6-2 shows “bubbles” in both the execution and
memory pipelines. When loads are issued for accessing vertex data, the
Figure 6-1 Effective Latency Reduction as a Function of Access Stride
Upperbound of Pointer-Chasing Latency Reduction
0%
20%
40%
60%
80%
100%
120%
64
8
0
96
1
12
128
1
44
1
60
176
1
92
2
08
2
24
240
Stride (Bytes)
Effective Latency Reduction
Fam.15; Model 3, 4
Fam.15; Model 0,1,2
Fam. 6; Model 13
Fam. 6; Model 14
Fam. 15; Model 6