Intel IA-32 Computer Accessories User Manual


 
IA-32 Intel® Architecture Optimization
6-28
Prefetch concatenation can bridge the execution pipeline bubbles
between the boundary of an inner loop and its associated outer loop.
Simply by unrolling the last iteration out of the inner loop and
specifying the effective prefetch address for data used in the following
iteration, the performance loss of memory de-pipelining can be
completely removed. Example 6-5 gives the rewritten code.
This code segment for data prefetching is improved and only the first
iteration of the outer loop suffers any memory access latency penalty,
assuming the computation time is larger than the memory latency.
Inserting a prefetch of the first data element needed prior to entering the
nested loop computation would eliminate or reduce the start-up penalty
for the very first iteration of the outer loop. This uncomplicated
high-level code optimization can improve memory performance
significantly.
Example 6-4 Using Prefetch Concatenation
for (ii = 0; ii < 100; ii++) {
for (jj = 0; jj < 32; jj+=8) {
prefetch a[ii][jj+8]
computation a[ii][jj]
}
}
Example 6-5 Concatenation and Unrolling the Last Iteration of Inner Loop
for (ii = 0; ii < 100; ii++) {
for (jj = 0; jj < 24; jj+=8) { /* N-1 iterations */
prefetch a[ii][jj+8]
computation a[ii][jj]
}
prefetch a[ii+1][0]
computation a[ii][jj]/* Last iteration */
}