Developer’s Manual March, 2003 B-33
Intel
®
80200 Processor based on Intel
®
XScale
™
Microarchitecture
Optimization Guide
B.4.4.11. Loop Interchange
As mentioned earlier, the sequence in which data is accessed affects cache thrashing. Usually, it is
best to access data in a contiguous spatially address range. However, arrays of data may have been
laid out such that indexed elements are not physically next to each other. Consider the following C
code which places array elements in row major order.
for(j=0; j<NMAX; j++)
for(i=0; i<NMAX; i++)
{
prefetch(A[i+1][j]);
sum += A[i][j];
}
In the above example, A[i][j] and A[i+1][j] are not sequentially next to each other. This situation
causes an increase in bus traffic when prefetching loop data. In some cases where the loop
mathematics are unaffected, the problem can be resolved by induction variable interchange. The
above examples becomes:
for(i=0; i<NMAX; i++)
for(j=0; j<NMAX; j++)
{
prefetch(A[i][j+1]);
sum += A[i][j];
}
B.4.4.12. Loop Fusion
Loop fusion is a process of combining multiple loops, which reuse the same data, in to one loop.
The advantage of this is that the reused data is immediately accessible from the data cache.
Consider the following example:
for(i=0; i<NMAX; i++)
{
prefetch(A[i+1], c[i+1], c[i+1]);
A[i] = b[i] + c[i];
}
for(i=0; i<NMAX; i++)
{
prefetch(D[i+1], c[i+1], A[i+1]);
D[i] = A[i] + c[i];
}
The second loop reuses the data elements A[i] and c[i]. Fusing the loops together produces:
for(i=0; i<NMAX; i++)
{
prefetch(D[i+1], A[i+1], c[i+1], b[i+1]);
ai = b[i] + c[i];
A[i] = ai;
D[i] = ai + c[i];
}