Support User Manuals

Intel IA-32 Computer Accessories User Manual

Open as PDF

of 568

IA-32 Intel® Architecture Optimization

6-28

Prefetch concatenation can bridge the execution pipeline bubbles

between the boundary of an inner loop and its associated outer loop.

Simply by unrolling the last iteration out of the inner loop and

specifying the effective prefetch address for data used in the following

iteration, the performance loss of memory de-pipelining can be

completely removed. Example 6-5 gives the rewritten code.

This code segment for data prefetching is improved and only the first

iteration of the outer loop suffers any memory access latency penalty,

assuming the computation time is larger than the memory latency.

Inserting a prefetch of the first data element needed prior to entering the

nested loop computation would eliminate or reduce the start-up penalty

for the very first iteration of the outer loop. This uncomplicated

high-level code optimization can improve memory performance

significantly.

Example 6-4 Using Prefetch Concatenation

for (ii = 0; ii < 100; ii++) {

for (jj = 0; jj < 32; jj+=8) {

prefetch a[ii][jj+8]

computation a[ii][jj]

}

}

Example 6-5 Concatenation and Unrolling the Last Iteration of Inner Loop

for (ii = 0; ii < 100; ii++) {

for (jj = 0; jj < 24; jj+=8) { /* N-1 iterations */

prefetch a[ii][jj+8]

computation a[ii][jj]

}

prefetch a[ii+1][0]

computation a[ii][jj]/* Last iteration */

}

previous next